Verb annotations
Introduction
This document describes the semantic annotation of 121 Dutch verbs: herroepen, heffen, huldigen, haten, herhalen, herinneren, diskwalificeren, harden, herstellen, helpen, haken, herstructureren. Both the distribution of the sense tags as attributed by anonymous annotators and their corrected versions will be presented. Before describing the schema followed by each section, some terminological clarification is in order.
Small glossary
- Majority sense
- A sense that was assigned to a token by the majority of its annotators (at least 2).
- The act of assigning such a sense is called an agreement and the annotators may be called agreeing annotators.
- When the annotators did not agree on any given sense, the majority sense is no_agreement.
- Alternative (sense)
- A sense that was assigned to a token by a minority of its annotators (only 1).
- The act of assigning such a sense is called a disagreement or disagreeing annotation and the annotator may be called disagreeing/dissenting annotator.
- Full agreement
- The case that all 3-4 annotators of a token assign the same sense.
- Geen tag
- Assignment of a “none of the above” tag. This was classified as cases of wrong_lemma, not_listed, unclear and between based on the annotators’ comments.
- Final sense
- The sense tag assigned by us to a given token, considering but not fully relying on the majority sense and comments.
- Batch
- Set of 40 tokens of the same lemma annotated by the same group of 3-4 annotators.
- The annotators of the first batch of lemma X don’t need to match those of the first batch of lemma Y, but will normally share the same batches in four or five different verbs. In few cases, one person could annotate two batches of the same lemma.
- Normally the Netherlandic sources are in the first batches and the Flemish, in the last ones.
- Cue
- Context word selected by an annotator as informative/helpful for assigning a sense.
- Only cues selected by agreeing annotators as such, and if they also agree with the final sense, will be considered.
Schema of the descriptions
For each lemma, the following information will be discussed: original senses and annotations, final senses after a revised reading of the concordance, the most frequent dependency paths stemming from the target, lists of tokens to look for in the vector space models and whether any tokens must be removed from the concordance and why. The next paragraphs will explain in more detail what to expect in each subsection.
Original senses and annotations
First, the original definitions and examples as given to the annotators will be shown, next to their English translations.
Second, the frequency of the annotations by each annotator will be shown in a barplot, which illustrates both the distribution of the senses across batches and the level of agreement within each batch. Next to this first plot, disagreeing annotations will be shown. One general plot will summarize how many disagreements occurred in each batch, by which annotator and against which majority sense, to assess whether confusion is spread or concentrated on certain annotators.
Further plots per sense (in their respective tabs) will show which sense each annotator assigned to the tokens with a given majority sense, to get a better idea of which senses were more problematic and whether the issue was spread or concentrated on some annotators.
Final senses
After a revised reading of the concordances, disagreements may be solved and sometimes even sense tags reassigned. If one of the original senses turns out to be too infrequent or not as expected, it will be removed, and new tags may be included for tokens that don’t conform to any of the original senses, especially if the annotators reported the issue.
This section will address the final distribution: while majority sense still refers to the tag assigned by the majority of the annotators, final sense is the tag that will be used when modelling the lemma, and it might overlap to a greater or lesser degree with the original distribution.
After reporting modifications to the definitions, if any, three sections follow: “Original versus final sense distribution”, “Reliable cues” and “Most frequent dependency paths”.
Original versus final sense distribution
The first plot and table show which kind of modification was applied to each token annotation:
- majority
- The majority sense was accepted.
- correct
- The majority sense was not accepted, and another one of the original tags was assigned instead.
- new
- A new sense tag, not contemplated in the original senses, is assigned. It could either constitute a new tag at the same level as the others or be subdued as subsense of one of the original tags.
- idiom
- An idiomatic expression was identified. It could even work as a new tag at the same level as the others or be subdued to one of the original tags.
- remove
- The token was removed.
The third plot correlates original majority sense and final sense assignments, splitting between cases with and without full agreement to check if there are more corrections in the latter than in the former.
Reliable cues
A series of tables will show the top ranked cues per sense. The goal is to have an idea of which context words characterize a given sense better, among the final senses, so only annotations where the assigned sense tag matches the final sense will be taken into account. Furthermore, the individual assignments and cue selection are not particularly reliable, so a threshold of two votes per cue was set. That means that we will only count cases where at least two annotators selected a cue as such for a sense that was also the final sense. If they didn’t both choose the same sense and context word, it is ignored.
One technical disclamer is in order: when the annotation procedure started, there was a bug in the section of the annotation tool responsible of recording the cues. If the same wordform occurred more than once in the context of a given token, and one of its instances was selected as cue, only the first instance was recorded, regardless of whether it was correct or not. The bug was identified and the annotators were notified, but not all of them corrected the results.
The first table shows the top 10 lemma-part-of-speech combinations selected as cues per sense. Then, for each of the senses a dedicated table repeats the top 10 lemma-part-of-speech combinations and adds the top 10 relative positions, dependency paths and dependency path lengths (aka steps).
- Relative position
- The relative position of a context word in relation to the target is expressed as a combination of a letter (L or R) and a number (minimum 1) so that R1 is the first token to the right of the target, L2 is the second token to the left of the target, etc.
- Dependency path
- The path from each context word to the target along the dependency tree was calculated. In the formula,
#Trepresents the target and->the direction from head to dependent, followed by the dependency relation and the dependent separated by a colon, like in the dependency module;CWrepresents the selected cue. E.g.:#T->obj1:CWmeans that the cue is the direct object of the target;X->[mod:CW,det:#T]would mean that the target is the determiner of an item X of which the cue is a modifier. - A description of the tags can be found here.
- The code that drew these paths is rather rough, so some weird patterns might come up (e.g.
moet->vc:word->[su:CW,vc:#T]when it should’ve beenmoet->[su:CW,vc:word->vc:#T]), but they are minority and it should work fine in the dependency module. - NA values indicate that the cue is beyond the sentence, and therefore has no dependency path to the target.
- Steps (path length)
- The steps required to go from the target to the cue in the dependency tree, e.g.: 1 for
#T->obj1:CW, 2 forX->[mod:CW,det:#T]…
Most frequent dependency paths
A plot will show the frequency of dependency paths that occur in at least half the tokens of some sense. This does not filter the context words in any way, either by bag-of-words distance, part of speech or dependency links (so useless but frequent paths like #T->punct:CW show up), but it may be filtered by frequency or path length.
Lists
Some tokens or characteristics thereof might be interesting to look at in the vector spaces, but don’t warrant categorizing all the tokens; instead, lists are made. These lists could group attestations of the following phenomena, among others:
- Nominalizations
- the closest context words and dependency relations will be different, but the lemma still matches the target and can be assigned a sense.
- It could make it harder to distinguish between transitive and intransitive constructions.
- Garden-path tokens
- There is some deceiving context word that could trick the models into grouping the token with the wrong category.
- Atypical contexts
- The target either occurs in an atypical combination that can still be parsed, or in unexpected contexts such as lists, poetry fragments, etc.
- Headlines
- Relatively short sentences without punctuation; sometimes the division between sentences is hard to find by the annotators, and/or relevant context words are likely to be found outside (as elaborations of the headline).
- Titles
- Short independent phrases that are not separated from the rest of the context by sentence delimiters, but for which the external context is rarely helpful. This is a much bigger issue with nouns than with verbs.
- Encyclopedic knowledge needed
- Cases in which dentifying and recognizing proper names in the context goes a long way into successful disambiguation.
Removed tokens
Some tokens might be excluded from future analysis because of any of the following reasons: it actually belongs to a different lemma, it is a duplicate from another token, or it belongs to a valid category (an original or new sense, an “in between sense”) but it is too infrequent (normally <1%).
HERROEPEN
Original senses and annotations
The tokens of herroepen were annotated with 2 senses in 6 batches; the tags in Table 1 were suggested.
| Definitions |
|---|
| herroepen_1 |
| (trans.) m.b.t. wetten, besluiten e.d.: intrekken, niet langer geldig verklaren: een besluit, volmacht, decreet herroepen |
| (trans.) w.r.t. laws, decisions and such: withdraw, declare not valid anymore: annul a decision, power of attorney, decree |
| herroepen_2 |
| (trans.) m.b.t. uitspraken, meningen e.d.: terugnemen en rechtzetten: Trump moest weer een van zijn dwaze tweets herroepen |
| (trans.) w.r.t. statements, opinions and such: retract and correct: Trump had to retract one of his crazy tweets again |
Figure 1 shows the sense distribution by annotator and batch and Figure 2, that of the disagreements. Figure 3 shows the sense tags that each annotator of each batch assigned to the tokens with herroepen_1 as majority sense and Figure 4 those for herroepen_2.
General distribution
Both senses seem roughly equally frequent in the first three batches, while herroepen_1 is much more frequent in the other three. There is not too much disagreement between annotators of the same batch and only three tokens without any agreement at all, which could be assigned a sense.
Figure 1. Distribution of senses of ‘herroepen’ per annotator and batch.
Figure 2. Distribution of disagreeing annotations of ‘herroepen’ per annotator and batch.
Disagreement in herroepen_1
There are few disagreements regarding herroepen_1, most of them concentrated on two particular annotators.
Figure 3. Sense annotations of tokens with ‘herroepen_1’ as majority sense.
Disagreement in herroepen_2
The second sense, herroepen_2, is much less frequent in the last batches, particularly in batch 6, and seems to have a bit more disagreement, mostly concentrated on a couple of specific annotators.
Figure 4. Sense annotations of tokens with ‘herroepen_2’ as majority sense.
Final senses
The final definitions are the same as the original definitions: no (sub)senses were added or modified.
Original versus final sense distribution
Of the 240 tokens of herroepen, 228 kept their original majority senses, 8 were corrected to another original sense, and 4 were removed.
Table 2 shows in how many tokens with each majority sense which actions were taken, and Figure 5 illustrates the frequency of the final tags. Figure 6 correlates the original majority sense and the final senses.
Figure 5. Final distribution of senses of ‘herroepen’.
| original | correct | majority | remove |
|---|---|---|---|
| herroepen_1 | 2 | 138 | 1 |
| herroepen_2 | 2 | 90 | 2 |
| no_agreement | 3 | 0 | 0 |
| unclear | 1 | 0 | 0 |
| wrong_lemma | 0 | 0 | 1 |
Figure 6. Majority and final senses of ‘herroepen’.
Reliable cues
Table 3 shows the most frequent context words selected by the annotators as relevant. Table 4 and Table 5 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags herroepen_1 and herroepen_2.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 20 have no cues that match these criteria. 146 have one single cue and 74 have more than one (up to 10).
Across senses
The two senses have different collocates, three of which stand out by their frequency. The last columns represent a token with wrong_lemma as majority sense, (8) below, in which the annotators tagged a name as cue, presumably as indicating that the target did not belong to any of the other categories.
| Rank | herroepen_1 | n | herroepen_2 | n1 | remove | n2 |
|---|---|---|---|---|---|---|
| 1 | beslissing/noun | 30 | verklaring/noun | 17 | Verstraete/name | 1 |
| 2 | besluit/noun | 9 | uitspraak/noun | 15 | 0 | |
| 3 | veroordeling/noun | 5 | bekentenis/noun | 3 | 0 | |
| 4 | vonnis/noun | 5 | zal/verb | 3 | 0 | |
| 5 | rechtbank/noun | 4 | zeg/verb | 3 | 0 | |
| 6 | wet/noun | 4 | zijn/det | 3 | 0 | |
| 7 | decreet/noun | 3 | bewering/noun | 2 | 0 | |
| 8 | schorsing/noun | 3 | dat/det | 2 | 0 | |
| 9 | word/verb | 3 | getuige/noun | 2 | 0 | |
| 10 | afspraak/noun | 2 | het/det | 2 | 0 |
herroepen_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest six slots to the left of the target or up to three to the right, up to 3 or 4 steps away in the dependency path and mainly as direct object (#T->obj1:CW) but also as passive subject (word->[vc:#T,su:CW], word being the verb worden) of the target.
In eight cases, the cues were beyond the sentence: these are three tokens where the theme is not specified within the sentence (“Het werd nooit herroepen.”, “Zoiets kan niet worden herroepen.”, “Dat werd later half herroepen…”2) and one where it was, but some context words beyond the sentence might be considered helpful too.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | beslissing/noun | 30 | L2 | 30 | #T->obj1:CW | 69 | 1 | 86 |
| 2 | besluit/noun | 9 | L3 | 25 | word->[vc:#T,su:CW] | 19 | 2 | 55 |
| 3 | veroordeling/noun | 5 | L1 | 23 | #T->mod:van->obj1:CW | 9 | 3 | 30 |
| 4 | vonnis/noun | 5 | L4 | 20 | #T->su:CW | 9 | 4 | 18 |
| 5 | rechtbank/noun | 4 | L5 | 16 | NA | 8 | NA | 8 |
| 6 | wet/noun | 4 | L6 | 14 | #T->mod:CW | 5 | 6 | 7 |
| 7 | decreet/noun | 3 | R2 | 14 | #T->mod:door->obj1:CW | 5 | 5 | 6 |
| 8 | schorsing/noun | 3 | R3 | 13 | ben->[vc:#T,su:CW] | 4 | 8 | 5 |
| 9 | word/verb | 3 | L14 | 8 | CW->vc:#T | 3 | 7 | 3 |
| 10 | afspraak/noun | 2 | L12 | 7 | kan->vc:word->[vc:#T,su:CW] | 3 | 9 | 2 |
herroepen_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the four closest slots to the left and right of the target (but not the first slot to the right?), as direct object (#T->obj1:CW) of the target or in any case one or two steps away in the dependency path. The nine cues beyond the sentence correspond to four tokens that also have some cue inside the sentence.3
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | verklaring/noun | 17 | L2 | 18 | #T->obj1:CW | 60 | 1 | 76 |
| 2 | uitspraak/noun | 15 | L1 | 14 | NA | 9 | 2 | 31 |
| 3 | bekentenis/noun | 3 | L4 | 14 | #T->mod:CW | 6 | 3 | 9 |
| 4 | zal/verb | 3 | L3 | 13 | #T->obj1:uitspraak->det:CW | 3 | NA | 9 |
| 5 | zeg/verb | 3 | R3 | 13 | CW->vc:#T | 3 | 4 | 7 |
| 6 | zijn/det | 3 | R2 | 10 | word->[vc:#T,su:CW] | 3 | 6 | 6 |
| 7 | bewering/noun | 2 | L6 | 8 | #T->mod:van->obj1:CW | 2 | 5 | 4 |
| 8 | dat/det | 2 | L5 | 7 | #T->obj1:en->cnj:CW | 2 | 7 | 2 |
| 9 | getuige/noun | 2 | L7 | 6 | #T->obj1:standpunt->mod:CW | 2 | 8 | 1 |
| 10 | het/det | 2 | L10 | 5 | CW->cnj:#T | 2 | 0 |
Most frequent dependency paths
Figure 7 shows the most frequent dependency paths colored by sense tag. There does seem to be a preference for the passive construction (X->[vc:#T,su:CW]) for herroepen_1 and for the active one (#T->obj1:CW) for herroepen_2.
Figure 7. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (11 tokens, mostly of herroepen_1);
- garden-path tokens, as is the case of (1) through (3), of herroepen_1: the object is uitspraak, which in its meaning “utterance” is a typical object of herroepen_2, but here means “sentence” (in Court context), which has legal consequences of the herroepen_1 kind.
- atypical contexts, as is the case of (4) through (7), of herroepen_2. In the first one, it is atypical that someone retracts someone else’s statements; in the second, that someone retracts all their messages (it does not sound like a normal sort of retraction), the third one is in verse and the fourth one is a reflexive construction.
2002-05-04 mr , lcp KV Mechelen krijgt licentie BRUSSEL - Het beroepscomité herriep gisteren de uitspraak van de licentiecommissie en besliste om KV Mechelen toch zijn licentie te geven . - moment van hun geboorte hun ouders nog geen Hongkongse burgers waren .
Daarmee herriep het hof zijn eigen uitspraak en gaf het Peking gelijk . Juristen nemen - van medeplichtigheid aan de moord op Fortuyn .
De uitspraak werd dinsdag weer herroepen , nadat de LPF zich zondag ook al had gedistantieerd van eerdere beschuldigingen aan het adres van Kok en Melkert . Maar het lijkt wel of het gemak waarmee hij de vergissingen van zijn voorgangers herroept , tegelijkertijd gepaard gaat met een hang naar nieuwe tegenstrijdigheden . ’ Ik - Covad , een jong Californisch bedrijf dat de DSL-verbinding daadwerkelijk tot stand brengt met een zogenoemde bridge ( een soort modem ) , moest al zijn e-mails en telefonische boodschappen die in augustus binnenstroomden , herroepen .
" Viktor , we zijn hard aan het werk om jouw DSL-verbinding - Welnu : ’ Het ware middelpunt van ons heelal Is niet de Aarde , doch de Zon ’ Hetgeen men toen niet maken kon ‘t Was namelijk nog lang geen Carnaval Weldra verscheen hij voor ’t Gerecht De perspectieven waren slecht Dus met het lot van Bruno in ’t verschiet Herriep hij braaf zijn ketterij’ Maar toch beweegt ze , ’ bromde hij Want schuldbewust was Galilei niet .
- Senaat telt iedere zetel .
Op één punt zal Bush zich vandaag misschien herroepen . Algemeen werd verwacht dat hij striktere regels voor financiële verslaglegging zou afkondigen
Removed tokens
4 tokens will be removed from the concordance: (8), where the target seems to be a one-word headline, and three duplicates.
- toe te laten .
Later viseerde men expliciet de joden . Herroepen Jan Verstraete kreeg voor zijn onderzoek niet alleen inzage in een tot op
HEFFEN
Original senses and annotations
The tokens of heffen were annotated with 2 senses in 6 batches; the tags in Table 6 were suggested.
| Definitions |
|---|
| heffen_1 |
| (trans.) m.b.t. materiële zaken: in de hoogte brengen, optillen: met geheven hoofd; hij heft met gemak 80 kilo in de hoogte |
| (trans.) w.r.t. material objects: move to a higher position, lift: lifting their head; he easily lifted 80 kg |
| heffen_2 |
| (trans.) m.b.t. geld e.d.: invorderen, eisen, opleggen: belasting, rente, accijns heffen |
| (trans.) w.r.t. money and such: collect, demand, impose: collect tax, interest, excise |
Figure 8 shows the sense distribution by annotator and batch and Figure 9, that of the disagreements. Figure 10 shows the sense tags that each annotator of each batch assigned to the tokens with heffen_1 as majority sense and Figure 11 those for heffen_2.
General distribution
The second sense seems to be consistently more frequent than the first one; there is little disagreement, with none at all in the first batch and a maximum of 8 disagreeing annotations in annotator 2 of batch 3.
There are 5 instances with no agreement: one is an instance of a compound hefplateau and the rest, of opheffen.
The 9 tokens with wrong_lemma as majority sense are instances of hebben (typos, then), opheffen and aanheffen, and the two with not_listed as majority sense instantiate a discarded sense, “adjourn”.
Figure 8. Distribution of senses of ‘heffen’ per annotator and batch.
Figure 9. Distribution of disagreeing annotations of ‘heffen’ per annotator and batch.
Final senses
The final definitions are the same as the original definitions: the one sense added based on the concordances and suggestions of the annotators, namely “adjourn”, was discarded because of its low frequency. In addition, some idiomatic expressions were identified, but they remain subordinated to heffen_1.
Original versus final sense distribution
Of the 240 tokens of heffen, 161 kept their original majority senses, none were corrected to another original sense, and 22 were removed. 57 tokens were identified as instances of some idiomatic expression.
Table 7 shows in how many tokens with each majority sense which actions were taken, and Figure 12 illustrates the frequency of the final tags. Figure 13 correlates the original majority sense and the final senses.
Figure 12. Final distribution of senses of ‘heffen’.
| original | idiom | majority | remove |
|---|---|---|---|
| heffen_1 | 57 | 21 | 4 |
| heffen_2 | 0 | 140 | 2 |
| no_agreement | 0 | 0 | 5 |
| not_listed | 0 | 0 | 2 |
| wrong_lemma | 0 | 0 | 9 |
Figure 13. Majority and final senses of ‘heffen’.
Reliable cues
Table 8 shows the most frequent context words selected by the annotators as relevant. Table 9 and Table 10 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags heffen_1 and heffen_2.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 8 have no cues that match these criteria. 161 have one single cue and 71 have more than one (up to 5).
Across senses
The most frequent cues for heffen_1 are the collocates corresponding to the idioms identified: “het glas heffen”, “de handen ten hemel heffen”, “een vinger(tje) heffen”. The two most frequent for heffen_2 are indeed very frequent, but it must be taken into account that they often occur in compounds (such as bronbelasting at the end of the table), which have lower frequencies themselves. The removed tokens do not exhibit a stable pattern of cues, which is understandable. These are mostly tokens of opheffen and aanheffen and they were not always identified by the annotators as a different lemma (sometimes as a “different sense”).
| Rank | heffen_1 | n | heffen_2 | n1 | remove | n2 |
|---|---|---|---|---|---|---|
| 1 | glas/noun | 33 | belasting/noun | 51 | op/part | 3 |
| 2 | hand/noun | 17 | tol/noun | 10 | aan/prep | 1 |
| 3 | het/det | 17 | accijns/noun | 9 | ban_vloek/noun | 1 |
| 4 | hemel/noun | 9 | te/comp | 5 | belang/noun | 1 |
| 5 | te/prep | 9 | entree/noun | 4 | controleer/verb | 1 |
| 6 | arm/noun | 6 | schenking_recht/noun | 3 | klap/noun | 1 |
| 7 | de/det | 4 | statie_geld/noun | 3 | krijg/verb | 1 |
| 8 | hun/det | 4 | successie_recht/noun | 3 | laat/verb | 1 |
| 9 | vinger/noun | 4 | boete/noun | 2 | op/prep | 1 |
| 10 | zijn/det | 4 | bron_belasting/noun | 2 | plan/noun | 1 |
heffen_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest 5 slots to the left and right of the target and up to two or three steps away in the dependency paths; they are mostly the object of the target but also the determiner of the objects glas and hand (in the forementioned idioms).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | glas/noun | 33 | R2 | 24 | #T->obj1:CW | 65 | 1 | 78 |
| 2 | hand/noun | 17 | L1 | 21 | #T->obj1:glas->det:CW | 13 | 2 | 48 |
| 3 | het/det | 17 | R3 | 18 | #T->obj1:hand->det:CW | 7 | 3 | 13 |
| 4 | hemel/noun | 9 | L2 | 17 | #T->mod:CW | 6 | 4 | 6 |
| 5 | te/prep | 9 | R1 | 12 | word->[vc:#T,su:CW] | 4 | 5 | 1 |
| 6 | arm/noun | 6 | R4 | 11 | #T->mod:te->obj1:CW | 3 | 6 | 1 |
| 7 | de/det | 4 | R5 | 10 | ->[ROOT:#T,ROOT:wil->dp:CW] | 2 | 7 | 1 |
| 8 | hun/det | 4 | L3 | 7 | #T->ld:CW | 2 | 0 | |
| 9 | vinger/noun | 4 | R6 | 7 | #T->obj1:arm->det:CW | 2 | 0 | |
| 10 | zijn/det | 4 | L4 | 3 | #T->obj1:arm->mod:CW | 2 | 0 |
heffen_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest three slots to the left of the target or, to a lesser degree, to the right, as either direct object (#T->obj1:CW) or passive subject (word->[vc:#T,su:CW], word being the verb worden) of the target and up to three steps away in the dependency path. The ten cues beyond the context correspond to 8 tokens: in 6 of them, there are also (enough) cues inside the sentence, in another one, the same context word occurs inside and outside the sentence and the latter was likely registered by a technical mistake, and in the last one the object of heffen is a pronoun and its antecedent, the cue, is indeed beyond the sentence.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | belasting/noun | 51 | L2 | 49 | #T->obj1:CW | 85 | 1 | 92 |
| 2 | tol/noun | 10 | L1 | 44 | word->[vc:#T,su:CW] | 14 | 2 | 38 |
| 3 | accijns/noun | 9 | L3 | 11 | NA | 10 | 3 | 22 |
| 4 | te/comp | 5 | L4 | 9 | #T->mod:van->obj1:CW | 6 | NA | 10 |
| 5 | entree/noun | 4 | R3 | 7 | CW->body:#T | 5 | 4 | 7 |
| 6 | schenking_recht/noun | 3 | L5 | 6 | CW->mod:die->body:word->vc:#T | 4 | 7 | 6 |
| 7 | statie_geld/noun | 3 | R2 | 6 | #T->obj1:of->cnj:CW | 3 | 5 | 2 |
| 8 | successie_recht/noun | 3 | R1 | 5 | CW->mod:die->body:#T | 3 | 0 | |
| 9 | boete/noun | 2 | R4 | 5 | #T->su:CW | 2 | 0 | |
| 10 | bron_belasting/noun | 2 | L10 | 4 | zal->vc:word->[vc:#T,su:CW] | 2 | 0 |
Most frequent dependency paths
Figure 14 shows the most frequent dependency paths colored by sense tag. While a direct object seems frequent in both, the presence of a determiner for said object is much more prominent for heffen_1, as well as the subject of the verb; verbs of which the target is a complement, on the other hand, are more present in heffen_2. While the passive construction was a relevant cue, the 14 instances in which the subject was tagged as cue were the only ocurrences - it represents then only 10% of the tokens with this sense.
Figure 14. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (6 tokens, all from heffen_2);
- headlines (1 token, from heffen_2);
- atypical context ((9), where the object is missing);
- idiomatic expressions: 35 instances of “het glas heffen”, 15 of “de handen ten hemel heffen” and 7 of “de vinger heffen” or a variant thereof. All of these are considered cases of heffen_1.
-
En dan nog : om de attributen te plaatsen , moet je ook heffen . Sinds november vorig jaar ga ik haast wekelijks en dit op eigen
Removed tokens
19 tokens will be removed because they are not instances of heffen, 2 because they instantiate another sense, namely “adjourn”, and one because it is too exceptional (“de loftrompet heffen”). Of the ones that do not match the target lemma, one is an instance of hefplateau ‘lifting platform’, while the rest are occurrences of hebben, where a typo lead to wrong annotation, and opheffen or aanheffen, where the particle was not counted as part of the verb.
HULDIGEN
Original senses and annotations
The tokens of huldigen were annotated with 2 senses in 6 batches; the tags in Table 11 were suggested.
| Definitions |
|---|
| huldigen_1 |
| (trans.) iets of iem. eer bewijzen, vieren: we huldigen de uitvinder van de herbruikbare broodzak |
| (trans.) celebrate, pay homage to someone or something: we honor the inventor of the reusable bread bag |
| huldigen_2 |
| (trans.) erkennen, aankleven, toegedaan zijn: een opvatting, mening, theorie huldigen |
| (trans.) acknowledge, follow, be commited to: hold a view, an opinion, a theory |
Figure 15 shows the sense distribution by annotator and batch and Figure 16, that of the disagreements. Figure 17 shows the sense tags that each annotator of each batch assigned to the tokens with huldigen_1 as majority sense and Figure 17 those for huldigen_2.
General distribution
The senses seem to be equally frequent in the first two batches, but huldigen_2 is more frequent in the third batch and extremely infrequent in the other three. Except for batch 6, where one annotator disagreed in 7 huldigen_1 cases, at least 90% of the tokens of each batch have full agreement. Only 2 have no agreement at all: one was resolved to huldigen_1 and the other one was an instance of inhuldigen. The one case with unclear as majority sense was removed.
Figure 15. Distribution of senses of ‘huldigen’ per annotator and batch.
Figure 16. Distribution of disagreeing annotations of ‘huldigen’ per annotator and batch.
Disagreement in huldigen_1
The first sense covers a quarter of the tokens of batch 3, half of the first two batches and more than three quarters of the other three; other than the huldigen_2 suggestions of the third annotator of batch 6, there is barely any disagreement.
Figure 17. Sense annotations of tokens with ‘huldigen_1’ as majority sense.
Disagreement in huldigen_2
The second sense covers 25% to 45% of the first two batches and almost three quarters of the third, where there are some disagreements, but 10% or less of the other three (probably lect-dependent, since the last batches tend to have Flemish tokens).
Figure 18. Sense annotations of tokens with ‘huldigen_2’ as majority sense.
Final senses
The final definitions are the same as the original definitions: no (sub)senses were added or modified.
Original versus final sense distribution
Of the 240 tokens of huldigen, 229 kept their original majority senses, 1 was corrected to another original sense, and 10 were removed.
Table 12 shows in how many tokens with each majority sense which actions were taken, and Figure 19 illustrates the frequency of the final tags. Figure 20 correlates the original majority sense and the final senses.
Figure 19. Final distribution of senses of ‘huldigen’.
| original | correct | majority | remove |
|---|---|---|---|
| huldigen_1 | 0 | 162 | 6 |
| huldigen_2 | 0 | 67 | 2 |
| no_agreement | 1 | 0 | 1 |
| unclear | 0 | 0 | 1 |
Figure 20. Majority and final senses of ‘huldigen’.
Reliable cues
Table 13 shows the most frequent context words selected by the annotators as relevant. Table 14 and Table 15 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags huldigen_1 and huldigen_2.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 15 have no cues that match these criteria. 105 have one single cue and 120 have more than one (up to 6).
Across senses
While some nouns are selected as cues for both senses, and belonging to quite different domains, the prepositions als and voor also stand out for huldigen_1. The few cues for the discarded tokens can be neglected, although the first one is important: most of those tokens were instances of inhuldigen.
| Rank | huldigen_1 | n | huldigen_2 | n1 | remove | n2 |
|---|---|---|---|---|---|---|
| 1 | kampioen/noun | 15 | principe/noun | 14 | in/adj | 1 |
| 2 | als/prep | 10 | standpunt/noun | 9 | te/comp | 1 |
| 3 | voor/prep | 8 | het/det | 7 | 0 | |
| 4 | gemeente_bestuur/noun | 6 | opvatting/noun | 7 | 0 | |
| 5 | winnaar/noun | 6 | de/det | 3 | 0 | |
| 6 | goed/adj | 5 | een/det | 2 | 0 | |
| 7 | laureaat/noun | 5 | ik/pron | 2 | 0 | |
| 8 | speler/noun | 5 | mening/noun | 2 | 0 | |
| 9 | verdienstelijk/adj | 5 | van/prep | 2 | 0 | |
| 10 | word/verb | 5 | aandeelhouder_schap/noun | 1 | 0 |
huldigen_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest 5 slots on any side of the target and up to 5 steps away in the depedency path; the most popular relations are the direct object (#T->obj1:CW, but also #T->obj1:en->cnj:CW, coordinated direct object), the passive subject (word->[vc:#T,su:CW] and ben->[vc:#T,su:CW]), a modifier (#T->mod:CW, mostly filled in by the prepositions als and voor, but also door for the agent of a passive construction) and the objects depending on such modifiers.
Eight cues are located beyond the sentence, in 7 tokens. In five cases, there are also (enough) cues inside the sentence; in another, the same wordform occurs outside and inside and the former was registered, while in the last one the annotators’ behaviour is difficult to explain.4
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | kampioen/noun | 15 | L1 | 32 | #T->obj1:CW | 51 | 2 | 132 |
| 2 | als/prep | 10 | R2 | 32 | word->[vc:#T,su:CW] | 38 | 1 | 85 |
| 3 | voor/prep | 8 | L2 | 31 | #T->mod:CW | 20 | 3 | 48 |
| 4 | gemeente_bestuur/noun | 6 | R1 | 29 | #T->mod:als->obj1:CW | 11 | 4 | 26 |
| 5 | winnaar/noun | 6 | L5 | 26 | #T->mod:voor->obj1:CW | 10 | 5 | 13 |
| 6 | goed/adj | 5 | R3 | 23 | #T->obj1:en->cnj:CW | 10 | NA | 8 |
| 7 | laureaat/noun | 5 | L3 | 19 | #T->mod:door->obj1:CW | 9 | 6 | 5 |
| 8 | speler/noun | 5 | R4 | 16 | ben->[vc:#T,su:CW] | 8 | 8 | 2 |
| 9 | verdienstelijk/adj | 5 | L6 | 15 | NA | 8 | 0 | |
| 10 | word/verb | 5 | L4 | 13 | CW->vc:#T | 7 | 0 |
huldigen_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the first three slots to the right of the target or the first one to the left, up to 2 steps away in the dependency path, mainly as direct object (#T->obj1:CW, but also #T->obj1:en->cnj:CW, coordinated direct object) and sometimes subject (#T->su:CW) of the target.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | principe/noun | 14 | R3 | 28 | #T->obj1:CW | 56 | 1 | 67 |
| 2 | standpunt/noun | 9 | R2 | 23 | #T->su:CW | 9 | 2 | 39 |
| 3 | het/det | 7 | L1 | 13 | #T->obj1:principe->det:CW | 7 | 3 | 7 |
| 4 | opvatting/noun | 7 | R1 | 12 | #T->obj1:en->cnj:CW | 4 | 4 | 5 |
| 5 | de/det | 3 | R4 | 9 | #T->obj1:Eerst’_principe->mod:CW | 2 | 5 | 2 |
| 6 | een/det | 2 | R5 | 8 | #T->obj1:Patrick->mwp:CW | 2 | 6 | 1 |
| 7 | ik/pron | 2 | L2 | 6 | #T->obj1:principe->mod:CW | 2 | 7 | 1 |
| 8 | mening/noun | 2 | L3 | 5 | CW->body:#T | 2 | 0 | |
| 9 | van/prep | 2 | R6 | 4 | CW->mod:dat->body:word->vc:#T | 2 | 0 | |
| 10 | aandeelhouder_schap/noun | 1 | L10 | 2 | CW->mod:die->body:#T | 2 | 0 |
Most frequent dependency paths
Figure 21 shows the most frequent dependency paths colored by sense tag. There seems to be a preference for passive construction and combination with a modifier for huldigen_1, and for active construction for huldigen_1.
Figure 21. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (1 token, from huldigen_2);
- headlines (12 tokens, mostly from huldigen_1);
- atypical context ((10), also a headline that requires encyclopedic knowledge).
Bis algemeen 2003-08-12 Didier Wijnants Josse De Pauw huldigt muzikaal op je bek gaan Acteur en auteur Josse De Pauw heeft als
Removed tokens
One token, (11), will be removed because it is nonsensical and 9 more because they instantiate inhuldigen instead of the target lemma.
- Zelfs het Journaal berichtte over 1500 zoenende landgenoten in Scheveningen .
RTL 4 huldigde in Valentijn 2004 , gepresenteerd door Irene van de Laar in debotel .
HATEN
Original senses and annotations
The tokens of haten were annotated with 2 senses in 6 batches; the tags in Table 16 were suggested.
| Definitions |
|---|
| haten_1 |
| (trans.) iem. haat toedragen, een sterk gevoel van afkeer en vijandschap t.o.v. iem. hebben: waarom haat hij mij zo? |
| (trans.) feel hatred, have a strong feeling of aversion and enmity towards someone: why does he hate me so much? |
| haten_2 |
| (trans.) iets onaangenaam, verfoeilijk, verwerpelijk vinden: hoe zou iemand de taalkunde kunnen haten? |
| (trans.) consider something unpleasant, detestable, reprehensible: how could someone hate linguistics? |
Figure 22 shows the sense distribution by annotator and batch and Figure 23, that of the disagreements. Figure 24 shows the sense tags that each annotator of each batch assigned to the tokens with haten_1 as majority sense and Figure 25 those for haten_2.
General distribution
The tokens seem to be split half and half between the senses, with some more instances of haten_1 in the first batch. There are 8 tokens with no agreement and some disagreements across all batches, but not that many.
Of the 8 tokens with no agreement, 4 were instances of the noun haat or English hate and were removed, while the rest could be assigned one of the senses.
Figure 22. Distribution of senses of ‘haten’ per annotator and batch.
Figure 23. Distribution of disagreeing annotations of ‘haten’ per annotator and batch.
Final senses
The final definitions are the same as the original definitions: no (sub)senses were added or modified. However, a small category was added to include 16 tokens that could belong to either haten_1 or haten_2.
Original versus final sense distribution
Of the 240 tokens of haten, 207 kept their original majority senses, 6 were corrected to another original sense, and 11 were removed. 16 tokens were assigned a new sense.
Table 17 shows in how many tokens with each majority sense which actions were taken, and Figure 26 illustrates the frequency of the final tags. Figure 27 correlates the original majority sense and the final senses.
Figure 26. Final distribution of senses of ‘haten’.
| original | correct | majority | new | remove |
|---|---|---|---|---|
| haten_1 | 2 | 99 | 9 | 2 |
| haten_2 | 1 | 108 | 5 | 5 |
| no_agreement | 2 | 0 | 2 | 4 |
| unclear | 1 | 0 | 0 | 0 |
Figure 27. Majority and final senses of ‘haten’.
Reliable cues
Table 18 shows the most frequent context words selected by the annotators as relevant. Table 19 and Table 20 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags haten_1 and haten_2.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 39 have no cues that match these criteria. 93 have one single cue and 108 have more than one (up to 6).
Across senses
The most common cues are personal pronouns and worden for haten_1 and determiners and wat for haten_2. The context words selected as cues in the rightmost columns all belong to the same token: they are all the words in the sentence where the token, here the wrong lemma, occurs, namely “Eén pennentrek gomt eeuwen haat niet weg”.
| Rank | haten_1 | n | haten_2 | n1 | remove | n2 |
|---|---|---|---|---|---|---|
| 1 | hem/pron | 9 | het/det | 18 | eén/num | 1 |
| 2 | ze/pron | 9 | de/det | 11 | eeuw/noun | 1 |
| 3 | word/verb | 7 | ik/pron | 10 | gomt/noun | 1 |
| 4 | ik/pron | 5 | wat/pron | 8 | niet/adv | 1 |
| 5 | te/comp | 5 | dat/det | 7 | pennentrek/noun | 1 |
| 6 | de/det | 4 | te/comp | 4 | weg/noun | 1 |
| 7 | elkaar/pron | 4 | verlies/verb | 3 | 0 | |
| 8 | hen/pron | 4 | winkel/verb | 3 | 0 | |
| 9 | hij/pron | 4 | woord/noun | 3 | 0 | |
| 10 | je/pron | 4 | dat/comp | 2 | 0 |
haten_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest 4 slots to the left or 3 to the right of the token, up to three steps away in the dependency path (but overwhelmingly one), and mainly as direct object (#T->obj1:CW, #T->obj1:en->cnj:CW) but also as active subject (#T->su:CW) and in other roles. The fact that so many cues share a path or at least a path length but not lemma indicate that there is quite a variety in the types that fill the popular slots.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | hem/pron | 9 | L1 | 37 | #T->obj1:CW | 69 | 1 | 99 |
| 2 | ze/pron | 9 | R1 | 32 | #T->su:CW | 15 | 2 | 47 |
| 3 | word/verb | 7 | L2 | 25 | CW->vc:#T | 7 | 3 | 20 |
| 4 | ik/pron | 5 | L3 | 25 | #T->obj1:en->cnj:CW | 6 | 4 | 7 |
| 5 | te/comp | 5 | R2 | 19 | word->[vc:#T,su:CW] | 6 | NA | 4 |
| 6 | de/det | 4 | L4 | 10 | CW->body:#T | 5 | 7 | 3 |
| 7 | elkaar/pron | 4 | L5 | 6 | NA | 4 | 5 | 2 |
| 8 | hen/pron | 4 | L6 | 5 | #T->mod:CW | 3 | 0 | |
| 9 | hij/pron | 4 | R3 | 5 | #T->mod:door->obj1:CW | 3 | 0 | |
| 10 | je/pron | 4 | L8 | 3 | #T->obj1:vader->det:CW | 2 | 0 |
haten_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest 3 slots on any side of the target, up to three steps away in the dependency path (but overwhelmingly one), and mainly as direct object (#T->obj1:CW, #T->obj1:en->cnj:CW) but also as active subject (#T->su:CW) and in other roles. Here as well, there is a wide variety of lemmas that can fill these slots.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | het/det | 18 | R1 | 58 | #T->obj1:CW | 88 | 1 | 119 |
| 2 | de/det | 11 | R2 | 37 | #T->su:CW | 16 | 2 | 46 |
| 3 | ik/pron | 10 | L1 | 31 | CW->body:#T | 8 | 3 | 18 |
| 4 | wat/pron | 8 | L3 | 15 | #T->obj1:en->cnj:CW | 5 | 4 | 6 |
| 5 | dat/det | 7 | L2 | 14 | ->[ROOT:#T,ROOT:CW] | 2 | 5 | 5 |
| 6 | te/comp | 4 | R3 | 12 | #T->dp:CW | 2 | 6 | 2 |
| 7 | verlies/verb | 3 | R4 | 6 | #T->mod:CW | 2 | NA | 2 |
| 8 | winkel/verb | 3 | L4 | 5 | #T->mod:om->body:te->body:CW | 2 | 7 | 1 |
| 9 | woord/noun | 3 | R5 | 5 | #T->obj1:wereld->det:CW | 2 | 0 | |
| 10 | dat/comp | 2 | L5 | 4 | #T->obj1:woord->det:CW | 2 | 0 |
Most frequent dependency paths
Figure 28 shows the most frequent dependency paths colored by sense tag. The profiles of each sense are quite similar, especially if we discard the punctuation; it does seem that a lower number of haten_1 tokens has no subject.
Figure 28. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- garden-path tokens ((12) and (13), from haten_1 and haten_2 respectively);
- headlines (2 tokens, from haten_2);
- title (1 token, from haten_1).
-
We schieten enkel het overtollige wild . De sperwer wordt wel gehaat door duivenliefhebbers . Voor sperwers zijn jonge duifjes een makkelijke prooi . -
Alles daarbuiten is voor mij valse en absurde Kunst , namaak ; ik haat de Leys en de Lies , de Tissots en de Comtes met hun valse naïviteit , hun onechte couleur locale en hun gewaden van zijde en gouddraad … "
Removed tokens
11 tokens will be removed: 2 are instances of the English hate, 5 of the noun haat, and 4 are partial duplicates: the same sentence, “Wat haat je?”, is repeated in different contexts that could only be distinguished by bag-of-words models without sentence boundaries, and barely at that.
DISKWALIFICEREN
Original senses and annotations
The tokens of diskwalificeren were annotated with 3 senses in 6 batches; the tags in Table 21 were suggested.
| Definitions |
|---|
| diskwalificeren_1 |
| (trans.) ongeschikt verklaren en uitsluiten van een bepaalde functie of positie: een getuige diskwalificeren |
| (trans.) declare unsuitable and exclude from a certain function or position: disqualify a witness |
| diskwalificeren_2 |
| (trans.) wegens onregelmatigheden uitsluiten bij een wedstrijd: FC De Trappers werd gediskwalificeerd wegens wangedrag |
| (trans.) exclude from a competition because of irregularities: FC De Trappers was disqualified because of misbehaviour |
| diskwalificeren_3 |
| (reflex.) zichzelf buiten spel zetten, zich onmogelijk maken: met zulk gedrag diskwalificeer je jezelf |
| (reflex.) exclude oneself, make oneself impossible: with such a behaviour you disqualify yourself |
Figure 29 shows the sense distribution by annotator and batch and Figure 30, that of the disagreements. Figure 31 shows the sense tags that each annotator of each batch assigned to the tokens with diskwalificeren_1 as majority sense, Figure 32 those for diskwalificeren_2 and Figure 33 for diskwalificeren_3.
General distribution
The second reading, diskwalificeren_2, is the most frequent one, especially in the last two batches (as expected; the sources in those batches are the Belgian newspapers, which tend to have more sport articles); the third sense is the most infrequent. There is little disagreement, with the most dissenting annotators disagreeing in only five instances; only one token presents no agreement at all and could be tagged as diskwalificeren_2, while the one with unclear as majority sense was removed.
Figure 29. Distribution of senses of ‘diskwalificeren’ per annotator and batch.
Figure 30. Distribution of disagreeing annotations of ‘diskwalificeren’ per annotator and batch.
Disagreement in diskwalificeren_1
The first sense covers about 20%-50% of each batch, with few disagreements: none in batch 3, and as few as 5 in batch 2, where it was disagreed the most. Suggestions include any possible tag.
Figure 31. Sense annotations of tokens with ‘diskwalificeren_1’ as majority sense.
Disagreement in diskwalificeren_2
The second sense covers about 50%-80% of each batch, mostly in the two last batches. There are few disagreements, mostly with diskwalificeren_1 as alternative and ocasionally with unclear or the third reading.
Figure 32. Sense annotations of tokens with ‘diskwalificeren_2’ as majority sense.
Disagreement in diskwalificeren_3
The third sense covers 1 to 7 tokens of each batch, mostly with diskwalificeren_1 (the other non sport-related reading) as alternative, in spite of the different argument structure (this one is reflexive instead of transitive).
Figure 33. Sense annotations of tokens with ‘diskwalificeren_3’ as majority sense.
Final senses
The final definitions are the same as the original definitions: no (sub)senses were added or modified.
Original versus final sense distribution
Of the 240 tokens of diskwalificeren, 230 kept their original majority senses, 8 were corrected to another original sense, and 2 were removed.
Table 22 shows in how many tokens with each majority sense which actions were taken, and Figure 34 illustrates the frequency of the final tags. Figure 35 correlates the original majority sense and the final senses.
Figure 34. Final distribution of senses of ‘diskwalificeren’.
| original | correct | majority | remove |
|---|---|---|---|
| diskwalificeren_1 | 2 | 64 | 0 |
| diskwalificeren_2 | 3 | 145 | 1 |
| diskwalificeren_3 | 2 | 21 | 0 |
| no_agreement | 1 | 0 | 0 |
| unclear | 0 | 0 | 1 |
Figure 35. Majority and final senses of ‘diskwalificeren’.
Reliable cues
Table 23 shows the most frequent context words selected by the annotators as relevant. Table 24, Table 25 and Table 26 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags diskwalificeren_1, diskwalificeren_2 and diskwalificeren_3.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 29 have no cues that match these criteria. 78 have one single cue and 133 have more than one (up to 6).
Across senses
The most frequent cues of diskwalificeren_1 are the preposition als and nouns of the domain of politics, while those of diskwalificeren_2 belong mostly to the domain of sports, including the expression “valse start” (a common cause of disqualification in a sports race). The most frequent ones for diskwalificeren_3, the reflexive reading, are of course zich and zichzelf.
| Rank | diskwalificeren_1 | n | diskwalificeren_2 | n1 | diskwalificeren_3 | n2 |
|---|---|---|---|---|---|---|
| 1 | als/prep | 9 | vals/adj | 11 | zich/pron | 14 |
| 2 | kandidaat/noun | 3 | start/noun | 10 | zichzelf/pron | 7 |
| 3 | partij/noun | 3 | finale/noun | 8 | met/prep | 1 |
| 4 | politiek/adj | 3 | meter/noun | 6 | uitlating/noun | 1 |
| 5 | gesprekspartner/noun | 2 | winnaar/noun | 6 | ze/pron | 1 |
| 6 | oud/adj | 2 | finish/noun | 5 | zijn/det | 1 |
| 7 | afgevaardigen/noun | 1 | wedstrijd/noun | 5 | 0 | |
| 8 | als/comparative | 1 | atleet/noun | 4 | 0 | |
| 9 | argument/noun | 1 | kampioenschap/noun | 4 | 0 | |
| 10 | behoefte/noun | 1 | olympisch/adj | 4 | 0 |
diskwalificeren_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest three slots to either side of the target, up to four steps away in the dependency path and mainly as direct object (#T->obj1:CW, #T->obj1:en->cnj:CW). The long path in the sixth row corresponds to two coordinated items in the same token, partijleden and kiezers in “…gediskwalificeerd in de ogen van veel partijleden en kiezers”.
The six cues beyond sentence boundaries belong to three tokens; in two of them, there are also (enough) cues inside the sentence, while in the third one the target occurs in a short “sentence” (“Bij voorbaat gediskwalificeerd zijn:”) followed by numerated items.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | als/prep | 9 | R3 | 10 | #T->obj1:CW | 12 | 2 | 20 |
| 2 | kandidaat/noun | 3 | L2 | 9 | NA | 6 | 1 | 19 |
| 3 | partij/noun | 3 | L6 | 9 | #T->mod:CW | 5 | 3 | 19 |
| 4 | politiek/adj | 3 | R2 | 9 | #T->mod:als->obj1:CW | 4 | 4 | 12 |
| 5 | gesprekspartner/noun | 2 | R1 | 8 | #T->obj1:en->cnj:CW | 3 | NA | 6 |
| 6 | oud/adj | 2 | L3 | 7 | ->[ROOT:#T,ROOT:zal->dp:in->obj1:oog->mod:van->obj1:en->cnj:CW] | 2 | 6 | 5 |
| 7 | afgevaardigen/noun | 1 | R4 | 6 | #T->mod:van->obj1:CW | 2 | 7 | 4 |
| 8 | als/comparative | 1 | L4 | 5 | #T->mod:wegens->obj1:besef->mod:CW | 2 | 5 | 3 |
| 9 | argument/noun | 1 | L5 | 5 | ben->[vc:#T,su:CW] | 2 | 9 | 2 |
| 10 | behoefte/noun | 1 | L8 | 5 | word->[vc:#T,dp:als->obj1:en->cnj:CW] | 2 | 10 | 2 |
diskwalificeren_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the five closes slots to the left, the third closest slot to the right and the seventh and eight closes slots to either side of the target. They can be up to six steps away in the dependency path, but not so frequently one step away, and very often (in 50 tokens, of which 18 only have such cues) beyond the sentence boundary.
The most frequent paths among these cues are the passive subject (word->[vc:#T,su:CW], ben->[vc:#T,su:CW]), the object linked through prepositions (#T->mod:wegens->obj1:CW and cases with in, na, bij, tijdens…), and the direct object (#T->obj1:CW).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | vals/adj | 11 | L7 | 21 | NA | 82 | NA | 82 |
| 2 | start/noun | 10 | R3 | 20 | word->[vc:#T,su:CW] | 12 | 2 | 64 |
| 3 | finale/noun | 8 | L1 | 19 | #T->mod:wegens->obj1:CW | 11 | 3 | 53 |
| 4 | meter/noun | 6 | L2 | 18 | #T->obj1:CW | 9 | 4 | 41 |
| 5 | winnaar/noun | 6 | R8 | 18 | #T->mod:in->obj1:CW | 8 | 5 | 33 |
| 6 | finish/noun | 5 | L8 | 17 | #T->mod:na->obj1:CW | 8 | 6 | 20 |
| 7 | wedstrijd/noun | 5 | L4 | 16 | #T->mod:bij->obj1:CW | 5 | 1 | 15 |
| 8 | atleet/noun | 4 | L3 | 15 | ben->[vc:#T,su:CW] | 5 | 7 | 11 |
| 9 | kampioenschap/noun | 4 | L5 | 15 | #T->mod:na->obj1:start->mod:CW | 4 | 8 | 8 |
| 10 | olympisch/adj | 4 | R5 | 15 | #T->mod:tijdens->obj1:CW | 4 | 9 | 5 |
diskwalificeren_3
Next to the list of types that were selected as cues, we can see that they mostly occur in the five closest slots to the left and the first to the right of the token, mainly one step away as the direct object (#T->obj1:CW). Even though there is a dedicated dependency tag for reflexive objects (se), it was not used for this verb.
The long path in the second row corresponds to the link between zich and the target lemma in “Als de kiezer dat zou moeten beoordelen, welke Kamer (of partij) diskwalificeert zich dan in de ogen van de kiezer?”. Parsing error.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | zich/pron | 14 | L2 | 7 | #T->obj1:CW | 20 | 1 | 22 |
| 2 | zichzelf/pron | 7 | L5 | 4 | ->ROOT:als->body:zal->su:kiezer->mod:welk->[body:#T,ROOT:CW] | 1 | 2 | 1 |
| 3 | met/prep | 1 | R1 | 4 | #T->mod:CW | 1 | 3 | 1 |
| 4 | uitlating/noun | 1 | L1 | 3 | #T->mod:met->obj1:CW | 1 | 6 | 1 |
| 5 | ze/pron | 1 | L3 | 3 | #T->mod:met->obj1:uitlating->det:CW | 1 | 0 | |
| 6 | zijn/det | 1 | L4 | 1 | #T->su:CW | 1 | 0 | |
| 7 | 0 | L8 | 1 | 0 | 0 | |||
| 8 | 0 | R2 | 1 | 0 | 0 | |||
| 9 | 0 | R3 | 1 | 0 | 0 | |||
| 10 | 0 | 0 | 0 | 0 |
Most frequent dependency paths
Figure 36 shows the most frequent dependency paths colored by sense tag. Passive construction and verbs of which the target is a complement are preferred by diskwalificeren_2, while direct object are preferred by the other two.
Figure 36. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (1 token from diskwalificeren_1);
- garden-path: (14) and (15), of diskwalificeren_1 and diskwalificeren_2 respectively. The former has a sport context but talks about the prestige of clubs rather than participation in a competition, while the latter co-occurs with zich but is not reflexive.
- headlines (1 token from diskwalificeren_1);
- atypical context (4 tokens of diskwalificeren_2, which are lists of results from competitions, but also (16) of diskwalificeren_1 with a missing object);
- encyclopedic knowledge necessary to disambiguate: (17) and (18), where it is necessary to recognize the names of the chess players and Formula 1 racers to know that it is a sport context (diskalificeren_2);
- metalinguistic use (2 tokens, from diskwalificeren_2, in which the target explains an abbreviation).
- een schandalig voorstel .
PSV verheft zichzelf boven de rest in Nederland en diskwalificeert een club als Vitesse door te praten over een Mickey Mouse-competitie . We - Nog afgezien van zijn misdragingen buiten de ring iemand die tijdens een gevecht tot twee keer toe zijn tegenstander in het oor bijt omdat hij op punten dreigt te verliezen , en zich zo doelbewust laat diskwalificeren , zo iemand heeft gewoonweg geen heart .
Misschien is boksen tegenwoordig weer - tillen we de zaak omdat we daarin wat slimmer zijn geworden .
Ik diskwalificeer niet , maar laat ik zeggen dat we al wat langer boekhouden dan de Spanjaard . " - vliegtuig te zitten .
Paniek bij de organisatie , Georgiev dreigde gediskwalificeerd te worden en in zijn plaats zou Sijbrands de hand mogen schudden van Van Vollenhoven . - een brief naar de commissarissen schreef .
Met het verzoek beide McLarens te diskwalificeren . " Zoiets kan gewoon niet , " liet Dennis zich ontvallen .
Removed tokens
2 tokens will be removed: one because the context is not enough to disambiguate, and the other one because it is a duplicate of another token.
HERSTRUCTUREREN
Original senses and annotations
The tokens of herstructureren were annotated with 3 senses in 6 batches; the tags in Table 27 were suggested.
| Definitions |
|---|
| herstructureren_1 |
| (trans.) reorganiseren, een nieuwe structuur geven: je kunt deze tekst maar beter herstructureren |
| (trans.) reorganizz, give a new structure: you should restructure this text |
| herstructureren_2 |
| (trans.) m.b.t. bedrijven in problemen: activiteiten of personeel afstoten, downsizen: Bayer herstructureert zijn plasticdivisie |
| (trans.) w.r.t. businesses in difficulties: remove activities or personeel, downsize: Bayer restructures its plastic division |
| herstructureren_3 |
| (intrans.) van bedrijven in problemen: activiteiten of personeel afstoten, downsizen: de chemie moet zich herstructureren |
| (intrans.) of businesses in difficulties: remove activities or personeel, downsize: chemistry must restructure (itself) |
Figure 37 shows the sense distribution by annotator and batch and Figure 38, that of the disagreements. Figure 39 shows the sense tags that each annotator of each batch assigned to the tokens with herstructureren_1 as majority sense, Figure 40 for those of herstructureren_2 and Figure 41 for herstructureren_3.
General distribution
The sense distribution is anything but stable, both between and and within batches. In some batches, even 10% of the tokens have no agreement at all and some annotators dissent in about 50% of their annotations, but mostly there is disagreement regarding tokens with either herstructureren_2 or herstructureren_3 (the “business” readings) as majority sense.
A total of 17 (7.08%) tokens have no agreement, but none have a geen majority sense. They could all be retagged to one of the senses.
Figure 37. Distribution of senses of ‘herstructureren’ per annotator and batch.
Figure 38. Distribution of disagreeing annotations of ‘herstructureren’ per annotator and batch.
Disagreement in herstructureren_1
This reading covers about 10%-30% of each batch, with some alternative annotations of the other senses, especially from annotator 3 of batch 1 and annotator 2 of batch 3.
Figure 39. Sense annotations of tokens with ‘herstructureren_1’ as majority sense.
Disagreement in herstructureren_2
This reading covers 25%-50% of each batch, although a large portion of them received the intransitive counterpart as alternative, particularly from some particular annotators, and some others the other transitive tag.
Figure 40. Sense annotations of tokens with ‘herstructureren_2’ as majority sense.
Final senses
The final definitions are the same as the original definitions: no (sub)senses were added or modified.
Original versus final sense distribution
Of the 240 tokens of herstructureren, 165 kept their original majority senses, 75 were corrected to another original sense, and none were removed.
Table 28 shows in how many tokens with each majority sense which actions were taken, and Figure 42 illustrates the frequency of the final tags. Figure 43 correlates the original majority sense and the final senses.
Figure 42. Final distribution of senses of ‘herstructureren’.
| original | correct | majority |
|---|---|---|
| herstructureren_1 | 6 | 39 |
| herstructureren_2 | 25 | 68 |
| herstructureren_3 | 27 | 58 |
| no_agreement | 17 | 0 |
Figure 43. Majority and final senses of ‘herstructureren’.
Reliable cues
Table 29 shows the most frequent context words selected by the annotators as relevant. Table 30, Table 31 and Table 32 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags herstructureren_1, herstructureren_2 and herstructureren_3.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 85 have no cues that match these criteria. 68 have one single cue and 87 have more than one (up to 10).
Across senses
As would be expected, the lemmas that were selected as cues for herstructureren_1 are different from those in the other two senses, which share bedrijf and baan. However, they are very infrequent –this could be due to the variety of lemmas, but also to the amount of disagreement between the annotators, which lower the chances of agreement in both sense tag and cue selection. Other than bedrijf as cue for the “business” senses, two lemmas stand out that actually represent syntactic constructions: the verb worden for herstructureren_2 and the particle te for herstructureren_3.
| Rank | herstructureren_1 | n | herstructureren_2 | n1 | herstructureren_3 | n2 |
|---|---|---|---|---|---|---|
| 1 | schuld/noun | 3 | word/verb | 14 | te/comp | 8 |
| 2 | het/det | 2 | bedrijf/noun | 9 | bedrijf/noun | 7 |
| 3 | kruispunt/noun | 2 | te/comp | 9 | om/comp | 5 |
| 4 | aantal/noun | 1 | het/det | 7 | moet/verb | 4 |
| 5 | administratie/noun | 1 | zijn/det | 4 | zich/pron | 4 |
| 6 | bedrijf_terrein/noun | 1 | activiteit/noun | 3 | ben/verb | 3 |
| 7 | belang/noun | 1 | moet/verb | 3 | dat/comp | 3 |
| 8 | boek/noun | 1 | afdeling/noun | 2 | het/det | 3 |
| 9 | Bornem_centrum/noun | 1 | baan/noun | 2 | baan/noun | 2 |
| 10 | choreografie/noun | 1 | de/det | 2 | fabriek/noun | 2 |
herstructureren_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest three slots to either side of the target, up to three steps away in the dependency path, and as direct object (#T->obj1:CW) of the target, but also as passive subject and in other relations.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | schuld/noun | 3 | L2 | 13 | #T->obj1:CW | 18 | 1 | 22 |
| 2 | het/det | 2 | L1 | 6 | word->[vc:#T,su:CW] | 5 | 3 | 13 |
| 3 | kruispunt/noun | 2 | R2 | 6 | #T->mod:van->obj1:CW | 3 | 2 | 11 |
| 4 | aantal/noun | 1 | R3 | 5 | moet->vc:word->[vc:#T,su:CW] | 2 | 4 | 4 |
| 5 | administratie/noun | 1 | L6 | 4 | ->[ROOT:#T,ROOT:wil->dp:CW] | 1 | 5 | 3 |
| 6 | bedrijf_terrein/noun | 1 | L3 | 3 | #T->det:CW | 1 | 6 | 3 |
| 7 | belang/noun | 1 | L4 | 3 | #T->mod:CW | 1 | 7 | 1 |
| 8 | boek/noun | 1 | L11 | 2 | #T->mod:om->body:te->body:word->predc:CW | 1 | 9 | 1 |
| 9 | Bornem_centrum/noun | 1 | L12 | 2 | #T->mod:om->obj1:CW | 1 | 10 | 1 |
| 10 | choreografie/noun | 1 | R1 | 2 | #T->mod:tot->body:CW | 1 | 0 |
herstructureren_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest six or seven slots to the left of the target and the first to the right, up to 3 or 4 steps in the dependency path, mainly as direct object (#T->obj1:CW) of the target or verb of which the target is a complement (CW->vc:#T, mostly worden but also hebben and moeten) but also in the construction “te herstructureren” (CW->body:#T) or as passive subject (word->[vc:#T,su:CW]).
The five cues beyond the sentence correspond to three tokens: in two of them, banen and verdwijnen are indeed good indicators of the “business” readings but they occur in a different sentence from the target; in the third one, for some unexplainable reason two annotators agreed both on the sense and on a context word from a different sense that is not related to the target.5
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | word/verb | 14 | L2 | 34 | #T->obj1:CW | 36 | 1 | 68 |
| 2 | bedrijf/noun | 9 | L1 | 32 | CW->vc:#T | 15 | 2 | 38 |
| 3 | te/comp | 9 | L3 | 20 | CW->body:#T | 9 | 3 | 16 |
| 4 | het/det | 7 | L4 | 9 | word->[vc:#T,su:CW] | 7 | 4 | 11 |
| 5 | zijn/det | 4 | L5 | 7 | NA | 5 | NA | 5 |
| 6 | activiteit/noun | 3 | L6 | 6 | #T->su:CW | 4 | 5 | 4 |
| 7 | moet/verb | 3 | R1 | 6 | #T->mod:CW | 2 | 6 | 2 |
| 8 | afdeling/noun | 2 | L7 | 5 | CW->cnj:#T | 2 | 9 | 2 |
| 9 | baan/noun | 2 | L8 | 4 | CW->vc:word->vc:#T | 2 | 12 | 1 |
| 10 | de/det | 2 | R2 | 4 | word->vc:of->[cnj:#T,su:CW] | 2 | 0 |
herstructureren_3
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest seven slots to the left of the target, up to three or four steps away in the dependency path. The most frequent path between a cue and the target seems to be one where they are both the root of the sentence. This occurs in 9 different sentences that must have confused the automatic parse —the wordforms of these cues are: ABX, dat, verklaarde, verkocht, te, is, aan, het, bedrijven, Hyperport, Palm.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | te/comp | 8 | L1 | 27 | ->[ROOT:#T,ROOT:CW] | 12 | 2 | 37 |
| 2 | bedrijf/noun | 7 | L2 | 17 | #T->obj1:CW | 9 | 3 | 28 |
| 3 | om/comp | 5 | L3 | 16 | CW->body:#T | 8 | 1 | 27 |
| 4 | moet/verb | 4 | L5 | 13 | CW->mod:om->body:te->body:#T | 5 | 4 | 13 |
| 5 | zich/pron | 4 | L4 | 10 | NA | 5 | 5 | 8 |
| 6 | ben/verb | 3 | L6 | 8 | CW->body:te->body:#T | 4 | NA | 5 |
| 7 | dat/comp | 3 | L7 | 6 | CW->vc:#T | 3 | 6 | 2 |
| 8 | het/det | 3 | R1 | 3 | en->[cnj:#T,cnj:CW] | 3 | 8 | 2 |
| 9 | baan/noun | 2 | R2 | 3 | #T->su:CW | 2 | 7 | 1 |
| 10 | fabriek/noun | 2 | R3 | 3 | ben->vc:aan->[body:#T,su:CW] | 2 | 9 | 1 |
Most frequent dependency paths
Figure 44 shows the most frequent dependency paths colored by sense tag. The only paths that seem to occur in at least half the tokens of some sense are the punctuation mark, the direct object and the modifier, which are dispreferred by herstructureren_3.
Figure 44. Tokens per path.
Tracking lists
- nominalizations (10 tokens, mostly from herstructureren_1 but also from the other senses);
- headlines (9 tokens, from all senses);
- atypical context (1 token of herstructureren_3 with an atypical object, namely leger);
- encyclopedic knowledge necessary to disambiguate ((19) and (19), where knowing what NDF, VWS and INDA stand for helps select herstructureren_1);
- the object is zich(zelf) and variations (10 cases, half annotated as herstructureren_1 and the other half as herstructureren_3, because the example in the original definition was reflexive).
- krijgen , gaat er aan de andere kant vanaf . ’
De NDF herstructureerde zich op last van VWS . Die reorganisatie was een geweldig karwei . - Minister van Cultuur Giovanna Melandri onderstreepte in een reactie op de arrestaties dat het Inda in 1998 is geherstructureerd , met onder andere verandering van de voltallige leiding .
Removed tokens
No tokens of herstructureren will be removed.
HERINNEREN
Original senses and annotations
The tokens of herinneren were annotated with 3 senses in 6 batches; the tags in Table 33 were suggested.
| Definitions |
|---|
| herinneren_1 |
| (met ‘aan’) weer te binnen brengen, in het geheugen terugroepen: iemand aan iets herinneren |
| (with aan ‘of’) bring back to the mind, to the memory: remind someone of something |
| herinneren_2 |
| (reflex.) in het geheugen aanwezig hebben, niet vergeten: zich een gebeurtenis, een persoon herinneren |
| (reflex.) have present in the memory, not forget: remember an event, a person |
| herinneren_3 |
| (trans.) met een plechtigheid, monument o.i.d. gedenken: we herinneren vandaag de Slag bij Ronceval |
| (trans.) remember with a celebration, monument and such: today we remember the Battle of Roncevaux Pass |
Figure 45 shows the sense distribution by annotator and batch and Figure 46, that of the disagreements. Figure 47 shows the sense tags that each annotator of each batch assigned to the tokens with herinneren_1 as majority sense, and Figure 48 those for herinneren_2, while herinneren_3 was too infrequent to require a plot.
General distribution
The second sense is always the most frequent and the third one the most infrequent, the latter with rarely any agreement at all. The distribution across annotators within batches is relatively stable, and everyone disagrees in at most 10% of their annotations, except for the first annotator of batch 4, who disagrees in almost 50% of the cases. There are only 2 cases with no agreement at all, both in batch 2; they were both assigned herinneren_3. No tokens had a geen tag as majority sense.
Figure 45. Distribution of senses of ‘herinneren’ per annotator and batch.
Figure 46. Distribution of disagreeing annotations of ‘herinneren’ per annotator and batch.
Disagreement in herinneren_1
This sense covers about 20%-50% of each batch; in each batch one annotator disagrees with a number of annotations, suggesting either herinneren_2 or herinneren_3 as alternative.
Figure 47. Sense annotations of tokens with ‘herinneren_1’ as majority sense.
Disagreement in herinneren_2
This sense covers at least half the tokens of each batch, and once beyond three quarters. There are few disagreeing annotations, with the remarkable outlier of annotator one in batch 4, who suggested herinneren_1 as alternative for about half their annotations.
Figure 48. Sense annotations of tokens with ‘herinneren_2’ as majority sense.
Disagreement in herinneren_3
There are only two tokens with this as majority sense: one in batch 1, with herinneren_1 as alternative, and one in batch 4, with herinneren_2 as alternative. The former actually was retagged as herinneren_1, while the latter does match herinneren_3.
Final senses
One definition, the one for herinneren_3, differs from the original one, based on the actual occurrences of the corpus, so that the final senses are the ones in Table 34. Still, that new reading is extremely infrequent.
| code | Definition |
|---|---|
| herinneren_1 | (with aan ‘of’) bring back to the mind, to the memory |
| herinneren_2 | (reflex.) have present in the memory, not forget |
| herinneren_3 | (trans.) in the construction “herinnered worden als”, keep in the collective memory |
Original versus final sense distribution
Of the 240 tokens of herinneren, 235 kept their original majority senses, 5 were corrected to another original sense, and none were removed.
Table 35 shows in how many tokens with each majority sense which actions were taken, and Figure 49 illustrates the frequency of the final tags. Figure 50 correlates the original majority sense and the final senses.
Figure 49. Final distribution of senses of ‘herinneren’.
| original | correct | majority |
|---|---|---|
| herinneren_1 | 0 | 75 |
| herinneren_2 | 2 | 159 |
| herinneren_3 | 1 | 1 |
| no_agreement | 2 | 0 |
Figure 50. Majority and final senses of ‘herinneren’.
Reliable cues
Table 36 shows the most frequent context words selected by the annotators as relevant. Table 37 and Table 38 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags herinneren_1 and herinneren_2.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 10 have no cues that match these criteria. 164 have one single cue and 66 have more than one (up to 6).
Across senses
The cues that distinguish these readings are mostly function words: aan and eraan for herinneren_1 and reflexive pronouns for herinneren_2, which is to be expected given that they are defined by such structures.
| Rank | herinneren_1 | n | herinneren_2 | n1 | herinneren_3 | n2 |
|---|---|---|---|---|---|---|
| 1 | aan/prep | 58 | zich/pron | 94 | word/verb | 1 |
| 2 | eraan/pp | 11 | me/pron | 47 | 0 | |
| 3 | word/verb | 3 | ik/pron | 18 | 0 | |
| 4 | de/det | 2 | mij/pron | 8 | 0 | |
| 5 | er/noun | 2 | kan/verb | 5 | 0 | |
| 6 | me/pron | 2 | een/det | 3 | 0 | |
| 7 | te/comp | 2 | hij/pron | 3 | 0 | |
| 8 | waaraan/pp | 2 | je/pron | 3 | 0 | |
| 9 | bewindsman/noun | 1 | dat/comp | 2 | 0 | |
| 10 | bij/prep | 1 | goed/adj | 2 | 0 |
herinneren_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest two slots to either side of the target, one step away in the dependency path, and mainly as prepositional complement (#T->pc:CW, filled in mostly by aan but also eraan and daaraan).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | aan/prep | 58 | R1 | 36 | #T->pc:CW | 63 | 1 | 85 |
| 2 | eraan/pp | 11 | L1 | 15 | #T->mod:CW | 8 | 2 | 10 |
| 3 | word/verb | 3 | R2 | 13 | #T->su:CW | 5 | 3 | 6 |
| 4 | de/det | 2 | L2 | 10 | #T->obj1:CW | 4 | 4 | 3 |
| 5 | er/noun | 2 | L5 | 5 | CW->vc:#T | 3 | 0 | |
| 6 | me/pron | 2 | L3 | 4 | #T->pc:aan->obj1:CW | 2 | 0 | |
| 7 | te/comp | 2 | L4 | 4 | CW->body:#T | 2 | 0 | |
| 8 | waaraan/pp | 2 | R3 | 4 | ->[ROOT:#T,ROOT:procedure->dp:mij->mod:CW] | 1 | 0 | |
| 9 | bewindsman/noun | 1 | L6 | 3 | ->ROOT:sla_terug->[dp:#T,ROOT:CW] | 1 | 0 | |
| 10 | bij/prep | 1 | R4 | 3 | #T->mod:aan->obj1:CW | 1 | 0 |
herinneren_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the first slot to the right of the target, but also up to three slots away, one step away in the dependency path, and mainly as reflexive complement (#T->se:CW), although the subject and some adverbial complements were also selected.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | zich/pron | 94 | R1 | 68 | #T->se:CW | 148 | 1 | 206 |
| 2 | me/pron | 47 | R2 | 34 | #T->su:CW | 30 | 2 | 22 |
| 3 | ik/pron | 18 | L1 | 28 | #T->mod:CW | 10 | 3 | 3 |
| 4 | mij/pron | 8 | L2 | 24 | #T->obj1:CW | 8 | 4 | 1 |
| 5 | kan/verb | 5 | L3 | 15 | kan->[vc:#T,su:CW] | 6 | 0 | |
| 6 | een/det | 3 | L4 | 12 | CW->vc:#T | 5 | 0 | |
| 7 | hij/pron | 3 | R3 | 12 | CW->body:#T | 3 | 0 | |
| 8 | je/pron | 3 | L5 | 9 | ->[ROOT:#T,ROOT:CW] | 2 | 0 | |
| 9 | dat/comp | 2 | L6 | 7 | #T->vc:CW | 2 | 0 | |
| 10 | goed/adj | 2 | L8 | 6 | hoe->[dp:#T,dp:CW] | 2 | 0 |
Most frequent dependency paths
Figure 51 shows the most frequent dependency paths colored by sense tag. The subject (#T->su:CW) and reflexive complement (#T->se:CW) are clearly preferred by herinneren_2, while the prepositional complement and its derivations (#T->pc:CW, #T->pc:X->obj1:CW, etc) go with herinneren_1.
Figure 51. Tokens per path.
Tracking lists
Only one list was compiled, with one element: (21), which semantically matches the first sense but without the preposition and, as it turns out, a personal pronoun as direct object that was parsed as reflexive complement.
- Ik gebruikte ze in Manchester en bracht ze mee naar mijn woonplaats Biarritz om me blijvend te herinneren dat ik een valsspeler ben . "
" Op Millars vraag schreef de
Removed tokens
No token of herinneren will be removed.
HERHALEN
Original senses and annotations
The tokens of herhalen were annotated with 3 senses in 8 batches; the tags in Table 39 were suggested.
| Definitions |
|---|
| herhalen_1 |
| (trans.) m.b.t. handelingen of activiteiten: opnieuw uitvoeren: een experiment, een les, een bezoek herhalen |
| (trans.) w.r.t. acts or activities: perform again: repeat an experiment, a lesson, a visit |
| herhalen_2 |
| (trans.) m.b.t. zinnen, boodschappen e.d.: opnieuw uitspreken: kunt u dat even herhalen? |
| (trans.) w.r.t. utterances, messages and such: pronounce again: Could you please repeat that? |
| herhalen_3 |
| (reflex.) zich opnieuw voordoen: de geschiedenis herhaalt zich |
| (reflex.) occur again: history repeats itself |
Figure 52 shows the sense distribution by annotator and batch and Figure 53, that of the disagreements. Figure 54 shows the sense tags that each annotator of each batch assigned to the tokens with herhalen_1 as majority sense, Figure 55 that for herhalen_2 and Figure 56 that for herhalen_3.
General distribution
The sense distribution is relatively stable across and within batches: herhalen_2 is the most frequent reading, and herhalen_3 is quite infrequent. Almost all annotators disagree at some point with their colleagues, on any sense.
9 tokens had no agreement and two with not_listed as majority sense, and they could be assigned herhalen_1 or the new herhalen_4 or were removed.
Figure 52. Distribution of senses of ‘herhalen’ per annotator and batch.
Figure 53. Distribution of disagreeing annotations of ‘herhalen’ per annotator and batch.
Disagreement in herhalen_1
In almost all batches there are a couple of disagreements with herhalen_2, but what jumps out the most are the not_listed annotations of the first annotator of batch 5. Most of these correspond a new herhalen_4 ‘broadcast again’ sense.
Figure 54. Sense annotations of tokens with ‘herhalen_1’ as majority sense.
Final senses
One definition (herhalen_4) was added, based on the actual occurrences of the corpus and the annotators’ suggestions, so that the final senses are the ones in Table 40.
| code | Definition |
|---|---|
| herhalen_1 | (trans.) w.r.t. acts or activities: perform again |
| herhalen_2 | (trans.) w.r.t. utterances, messages and such: pronounce again |
| herhalen_3 | (reflex.) occur again |
| herhalen_4 | (trans.) of a show or an episode, broadcast again |
Original versus final sense distribution
Of the 320 tokens of herhalen, 275 kept their original majority senses, 11 were corrected to another original sense, and 7 were removed. 27 tokens were assigned a new sense.
Table 41 shows in how many tokens with each majority sense which actions were taken, and Figure 57 illustrates the frequency of the final tags. Figure 58 correlates the original majority sense and the final senses.
Figure 57. Final distribution of senses of ‘herhalen’.
| original | correct | majority | new | remove |
|---|---|---|---|---|
| herhalen_1 | 1 | 80 | 20 | 3 |
| herhalen_2 | 4 | 159 | 2 | 2 |
| herhalen_3 | 2 | 36 | 0 | 0 |
| no_agreement | 4 | 0 | 3 | 2 |
| not_listed | 0 | 0 | 2 | 0 |
Figure 58. Majority and final senses of ‘herhalen’.
Reliable cues
Table 42 shows the most frequent context words selected by the annotators as relevant. Table 43, Table 44 and Table 45 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags herhalen_1, herhalen_2 and herhalen_3.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 320 tokens, 52 have no cues that match these criteria. 127 have one single cue and 141 have more than one (up to 7).
Across senses
The strongest profile is that of herhalen_3, which strongly correlates with the reflexive pronoun and its most frequent subject, geschiedenis: for herhalen_1, the passive construction and nouns designating actions/performances are relatively frequent, while for herhalen_2 the subordinating conjuntcion dat, pronouns and speech-related lexemes like woord and standpunt are typical cues.
| Rank | herhalen_1 | n | herhalen_2 | n1 | herhalen_3 | n2 | herhalen_4 | n3 |
|---|---|---|---|---|---|---|---|---|
| 1 | de/det | 5 | dat/comp | 25 | zich/pron | 30 | radio/noun | 1 |
| 2 | prestatie/noun | 5 | de/det | 8 | geschiedenis/noun | 14 | Teleac_serie/noun | 1 |
| 3 | word/verb | 5 | het/det | 8 | scenario/noun | 4 | zend_uit/verb | 1 |
| 4 | handeling/noun | 4 | hij/pron | 8 | de/det | 2 | 0 | |
| 5 | actie/noun | 3 | ik/pron | 8 | dat/det | 1 | 0 | |
| 6 | dit/det | 3 | woord/noun | 8 | discussie/noun | 1 | 0 | |
| 7 | experiment/noun | 3 | heb/verb | 7 | dit/det | 1 | 0 | |
| 8 | te/comp | 3 | standpunt/noun | 7 | doem_scenario/noun | 1 | 0 | |
| 9 | zijn/det | 3 | eerder/adj | 6 | drama/noun | 1 | 0 | |
| 10 | beweging/noun | 2 | zeg/verb | 6 | gebeurtenis/noun | 1 | 0 |
herhalen_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest 6 slots to the left of the target but also the first two to the right, up to two steps away in the dependency path and mainly as direct object (#T->obj1:CW) but also verb of which the target is a complement (CW->vc:#T, mainly filled by worden) or passive subject (word->[vc:#T,su:CW]) of the target.
The 8 cues beyond the sentence belong to 6 tokens: in all cases, the theme (what is being repeated) is either ellided or referred to by a pronoun within the sentence of the target, but can be extracted from neighboring sentences.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | de/det | 5 | L1 | 19 | #T->obj1:CW | 46 | 1 | 68 |
| 2 | prestatie/noun | 5 | L2 | 15 | CW->vc:#T | 8 | 2 | 33 |
| 3 | word/verb | 5 | L5 | 15 | NA | 8 | 3 | 12 |
| 4 | handeling/noun | 4 | L3 | 14 | #T->mod:CW | 5 | 4 | 9 |
| 5 | actie/noun | 3 | L6 | 10 | word->[vc:#T,su:CW] | 5 | NA | 8 |
| 6 | dit/det | 3 | L4 | 9 | #T->su:CW | 3 | 5 | 3 |
| 7 | experiment/noun | 3 | R2 | 9 | CW->body:#T | 3 | 6 | 2 |
| 8 | te/comp | 3 | R1 | 8 | #T->obj1:en->cnj:CW | 2 | 7 | 1 |
| 9 | zijn/det | 3 | L7 | 6 | #T->obj1:fout->det:CW | 2 | 9 | 1 |
| 10 | beweging/noun | 2 | L8 | 6 | #T->obj1:handeling->det:CW | 2 | 10 | 1 |
herhalen_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest 3 slots to either side of the target, one or maybe two steps away and mainly as direct object (#T->obj1:CW) but also as verbal complement (#T->vc:CW, mainly filled by dat, but also wat) or subject (#T->su:CW) of the target.
The 7 cues beyond the sentence belong to 5 sentences. In one of them, there is also another (sufficient) cue within the sentence; in two, the same cue occurs inside and outside the sentence and the latter was registered probably because of the known bug. In the other two, the object is a pronoun with a previous clause (of reported speech) as antecedent: the selected cues are part of the reported speech, but don’t help disambiguate beyond that particular relation.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | dat/comp | 25 | R1 | 57 | #T->obj1:CW | 66 | 1 | 159 |
| 2 | de/det | 8 | R2 | 40 | #T->vc:CW | 30 | 2 | 80 |
| 3 | het/det | 8 | L1 | 30 | #T->su:CW | 26 | 3 | 46 |
| 4 | hij/pron | 8 | R3 | 26 | #T->mod:CW | 19 | 4 | 19 |
| 5 | ik/pron | 8 | L2 | 22 | #T->mod:in->obj1:CW | 9 | NA | 7 |
| 6 | woord/noun | 8 | L3 | 21 | CW->vc:#T | 8 | 5 | 3 |
| 7 | heb/verb | 7 | L4 | 17 | #T->vc:wat->body:heb->vc:CW | 7 | 6 | 2 |
| 8 | standpunt/noun | 7 | L5 | 12 | word->[vc:#T,su:CW] | 7 | 0 | |
| 9 | eerder/adj | 6 | R4 | 11 | NA | 7 | 0 | |
| 10 | zeg/verb | 6 | R6 | 10 | heb->[vc:#T,su:CW] | 5 | 0 |
herhalen_3
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest slot to either side of the target, one or maybe two steps away in the dependency path and mainly as reflexive compelement (#T->se:CW) but also as subject of the target (#T->su:CW).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | zich/pron | 30 | R1 | 17 | #T->se:CW | 30 | 1 | 49 |
| 2 | geschiedenis/noun | 14 | L1 | 10 | #T->su:CW | 17 | 2 | 15 |
| 3 | scenario/noun | 4 | L2 | 9 | zal->[vc:#T,su:CW] | 3 | 3 | 5 |
| 4 | de/det | 2 | L3 | 7 | #T->su:geschiedenis->det:CW | 2 | 4 | 1 |
| 5 | dat/det | 1 | L4 | 6 | #T->su:scenario->det:CW | 2 | 5 | 1 |
| 6 | discussie/noun | 1 | R2 | 6 | lijk->vc:te->[body:#T,su:CW] | 2 | 0 | |
| 7 | dit/det | 1 | R3 | 5 | mag->[vc:#T,su:CW] | 2 | 0 | |
| 8 | doem_scenario/noun | 1 | L5 | 3 | #T->mod:CW | 1 | 0 | |
| 9 | drama/noun | 1 | L6 | 2 | #T->su:drama->det:CW | 1 | 0 | |
| 10 | gebeurtenis/noun | 1 | L7 | 2 | #T->su:geschiedenis->mod:CW | 1 | 0 |
Most frequent dependency paths
Figure 59 shows the most frequent dependency paths colored by sense tag. The reflexive complement (#T->se:CW) is clearly linked to herhalen_3 and the subject seems to be more frequent with herhalen_2 than with herhalen_1 and herhalen_4, which tend to have modifiers and be used as verbal complement.
Figure 59. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (1 token, from herhalen_1);
- garden-path tokens ((22), of herhalen_1, where geschiedenis is the object of a transitive construction with a different subject instead of the subject of a reflexive one);
- atypical contexts: (23), of herhalen_1, where the object is missing, and (24), which is in verse;
- headlines (3 tokens, from herhalen_1 and herhalen_2);
- tokens with zich in a non reflexive construction: in 5 tokens of herhalen_1, the object is a reflexive pronoun, and what is being repeated is someone’s artistic performance, with the added nuance of lack of creativity.
- in Amersfoort .
’ Wie de geschiedenis niet kent is gedwongen haar te herhalen ’ , is de algemene wijsheid . De IRA en de protestanten , - goed dat het cultuurseizoen op zijn eind loopt .
De kunstkletsprogramma’s zwijgen of herhalen , de laatste prijzen zijn uitgereikt , de recensenten gaan op vakantie . -
’ Nu ik op dit filmpje van taal / het gebeuren voor je herhaal , ’ schreef hij al in 1968 in ’ Landschap voor een dode meneer ’ .
Removed tokens
7 tokens will be removed: one because it is a duplicate of another token, and the rest because there is not enough context to distinguish between herhalen_1 and herhalen_2.
HELPEN
Original senses and annotations
The tokens of helpen were annotated with 3 senses in 6 batches; the tags in Table 46 were suggested.
| Definitions |
|---|
| helpen_1 |
| (trans.) ondersteunen in materiële of morele zin, bijstaan: met raad en daad helpen, een helpende hand, uit de nood helpen |
| (trans.) support in material or moral sense, assist: help in word and deed, a helping hand, help out |
| helpen_2 |
| (trans.) iem. assisteren door met hem samen te werken: helpen met het huiswerk; heb je dat alleen gedaan of heeft iemand je geholpen? |
| (trans.) assist someone by collaborating with them: help with homework, did you do that by yourself or did someone help you? |
| helpen_3 |
| (intrans.) voordeel opleveren, nuttig zijn: dat drankje heeft goed geholpen |
| (intrans.) yield advantage, be useful: that drink helped a lot |
Figure 60 shows the sense distribution by annotator and batch and Figure 61, that of the disagreements. Figure 62 shows the sense tags that each annotator of each batch assigned to the tokens with helpen_1 as majority sense, Figure 63 those for helpen_2 and Figure 64 for helpen_3.
General distribution
The sense distribution both across and within batches is quite unstable, with roughly helpen_1 as the most frequent and helpen_2 as the least frequent. Every annotator disagrees in about 25% of their annotations, mostly in tokens with helpen_1 or helpen_3 as majority sense.
15 tokens had no agreement, but all but one (which was removed) could be assigned a tag. The 10 tokens with not_listed as majority sense were either assigned a new tag or removed, and the one with wrong_lemma as majority sense was removed.
Figure 60. Distribution of senses of ‘helpen’ per annotator and batch.
Figure 61. Distribution of disagreeing annotations of ‘helpen’ per annotator and batch.
Disagreement in helpen_1
This sense covers about 30%-60% of each batch, with a number of cases of helpen_2 as alternative, or for annotator 1 of batch 3, not_listed.
Figure 62. Sense annotations of tokens with ‘helpen_1’ as majority sense.
Final senses
After the annotation, the definitions of helpen changed: helpen_4 and helpen_5 were added to gather an intransitive construction similar to helpen_1 but with inanimate entities and a construction with aan meaning “to provide”, respectively. While 21 tokens (mostly of helpen_1) instantiate a resultative construction with a preposition or adverb, the aan case warrants its own category because of its frequency and the tendency of the annotators to suggest a separate sense for them. The final definitions are shown in Table 47.
| code | Definition |
|---|---|
| helpen_1 | (trans.) support in material or moral sense, assist |
| helpen_2 | (trans.) assist someone by collaborating with them |
| helpen_3 | (intrans.) yield advantage, be useful |
| helpen_4 | (trans.) with inanimate entities, be helpful, useful |
| helpen_5 | (with aan) provide |
In addition, one idiom was identified that cannot be subdued to any of the other senses, namely “om zeep helpen” ‘to kill’. There are 7 tokens belonging to this category.
Original versus final sense distribution
Of the 240 tokens of helpen, 143 kept their original majority senses, 39 were corrected to another original sense, and 7 were removed. 38 tokens were assigned a new sense; 13 tokens were identified as instances of some idiomatic expression.
Table 48 shows in how many tokens with each majority sense which actions were taken, and Figure 65 illustrates the frequency of the final tags. Figure 66 correlates the original majority sense and the final senses.
Figure 65. Final distribution of senses of ‘helpen’.
| original | correct | idiom | majority | new | remove |
|---|---|---|---|---|---|
| helpen_1 | 14 | 1 | 60 | 17 | 4 |
| helpen_2 | 7 | 0 | 43 | 4 | 1 |
| helpen_3 | 8 | 5 | 40 | 10 | 0 |
| no_agreement | 10 | 0 | 0 | 4 | 1 |
| not_listed | 0 | 7 | 0 | 3 | 0 |
| wrong_lemma | 0 | 0 | 0 | 0 | 1 |
Figure 66. Majority and final senses of ‘helpen’.
Reliable cues
Table 49 shows the most frequent context words selected by the annotators as relevant. Table 50, Table 51 and Table 52 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags helpen_1, helpen_2 and helpen_3.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 100 have no cues that match these criteria. 50 have one single cue and 90 have more than one (up to 10).
Across senses
The most frequent cues for these senses are not very frequent: in helpen_1, the te complement stands out, while het helpt (niet) seems to be quite typical for helpen_3.
| Rank | helpen_1 | n | helpen_2 | n1 | helpen_3 | n2 | helpen_5 | n3 | om zeep helpen | n4 | remove | n5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | te/comp | 10 | bij/prep | 4 | het/det | 11 | aan/prep | 3 | om/fixed | 2 | !/punct | 1 |
| 2 | een/det | 5 | hem/pron | 3 | niet/adv | 8 | een/det | 2 | zeep/fixed | 2 | de/det | 1 |
| 3 | om/comp | 5 | te/comp | 3 | zal/verb | 5 | bruid/noun | 1 | om/adj | 1 | het/noun | 1 |
| 4 | ons/pron | 5 | commissaris/noun | 2 | dat/det | 4 | goed/adj | 1 | zeep/noun | 1 | kan/verb | 1 |
| 5 | de/det | 4 | deze/det | 2 | niets/noun | 3 | hij/pron | 1 | 0 | niet/adv | 1 | |
| 6 | mens/noun | 4 | Europa/name | 2 | alleen/adv | 2 | kaart/noun | 1 | 0 | uit/prep | 1 | |
| 7 | ik/pron | 3 | het/det | 2 | bij/prep | 2 | rood/adj | 1 | 0 | wereld/noun | 1 | |
| 8 | met/prep | 3 | ik/pron | 2 | de/det | 2 | 0 | 0 | 0 | |||
| 9 | bovenop/prep | 2 | met/prep | 2 | aanpak/noun | 1 | 0 | 0 | 0 | |||
| 10 | familie_lid/noun | 2 | moet/verb | 2 | aanschaf/noun | 1 | 0 | 0 | 0 |
helpen_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest four slots to the left of the target, one or maybe two steps away in the dependency path, mainly as direct object (#T->obj1:CW) but also complementizer (CW->body:#T, mostly filled by te).
The five cues beyond the sentence belong to two tokens: in one case, the context words inside the sentence are indeed not that informative, but in the other one there are also enough cues within the sentence.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | te/comp | 10 | L2 | 29 | #T->obj1:CW | 25 | 1 | 62 |
| 2 | een/det | 5 | L1 | 21 | CW->body:#T | 10 | 2 | 39 |
| 3 | om/comp | 5 | L3 | 16 | #T->mod:CW | 8 | 3 | 18 |
| 4 | ons/pron | 5 | L4 | 14 | NA | 5 | 4 | 14 |
| 5 | de/det | 4 | R1 | 7 | #T->ld:CW | 4 | 5 | 7 |
| 6 | mens/noun | 4 | R2 | 7 | #T->pc:CW | 4 | NA | 5 |
| 7 | ik/pron | 3 | L5 | 6 | #T->su:CW | 4 | 6 | 2 |
| 8 | met/prep | 3 | R3 | 6 | CW->body:te->body:#T | 4 | 7 | 2 |
| 9 | bovenop/prep | 2 | L10 | 4 | #T->obj1:en->cnj:CW | 3 | 14 | 1 |
| 10 | familie_lid/noun | 2 | L13 | 4 | #T->pc:met->obj1:CW | 3 | 0 |
helpen_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest slot to either side of the target, one or maybe two steps away in the dependency path, mainly as direct object (#T->obj1:CW) or verb complement (#T->vc:CW, te in “helpen te bevrijden”, bevrijden in “helpen bevrijden”).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | bij/prep | 4 | R1 | 20 | #T->obj1:CW | 16 | 1 | 50 |
| 2 | hem/pron | 3 | L1 | 12 | #T->vc:CW | 13 | 2 | 28 |
| 3 | te/comp | 3 | L2 | 10 | #T->su:CW | 9 | 3 | 14 |
| 4 | commissaris/noun | 2 | R3 | 9 | #T->pc:CW | 6 | 4 | 9 |
| 5 | deze/det | 2 | L3 | 6 | #T->pc:bij->obj1:CW | 5 | 5 | 3 |
| 6 | Europa/name | 2 | R2 | 6 | CW->vc:#T | 5 | 0 | |
| 7 | het/det | 2 | R5 | 6 | #T->vc:te->body:CW | 3 | 0 | |
| 8 | ik/pron | 2 | R4 | 5 | moet->[vc:#T,su:CW] | 2 | 0 | |
| 9 | met/prep | 2 | L5 | 4 | #T->ld:in->obj1:CW | 1 | 0 | |
| 10 | moet/verb | 2 | L6 | 4 | #T->ld:in->obj1:eenmanszaak->mod:van->obj1:CW | 1 | 0 |
helpen_3
Next to the list of types that were selected as cues, we can see that they mostly occur in the first slot to either side of the target, one or maybe two steps away in the dependency path, mainly as subject (#T->su:CW) or modifier (#T->mod:CW, mostly niet) of the target. The one context word outside the sentence occurs both inside and outside.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | het/det | 11 | L1 | 24 | #T->su:CW | 19 | 1 | 50 |
| 2 | niet/adv | 8 | R1 | 13 | #T->mod:CW | 14 | 2 | 14 |
| 3 | zal/verb | 5 | L3 | 10 | #T->obj1:CW | 6 | 3 | 9 |
| 4 | dat/det | 4 | L2 | 7 | CW->vc:#T | 5 | 4 | 8 |
| 5 | niets/noun | 3 | L4 | 6 | #T->pc:CW | 2 | 5 | 2 |
| 6 | alleen/adv | 2 | L5 | 5 | #T->su:Miracle->mod:samengesteld->pc:uit->obj1:en->cnj:CW | 2 | NA | 1 |
| 7 | bij/prep | 2 | R2 | 5 | heb->[vc:#T,su:of->cnj:aanpak->mod:CW] | 2 | 0 | |
| 8 | de/det | 2 | R3 | 5 | zal->[vc:#T,su:CW] | 2 | 0 | |
| 9 | aanpak/noun | 1 | L8 | 2 | ->ROOT:op->dp:zal->[vc:#T,ROOT:CW] | 1 | 0 | |
| 10 | aanschaf/noun | 1 | R4 | 2 | #T->ld:CW | 1 | 0 |
Most frequent dependency paths
Figure 67 shows the most frequent dependency paths colored by sense tag. The top two paths are almost exclusive of om zeep helpen and indicate a fixed expression; except for the subject, which is frequent for helpen_3, all of these paths are quite frequent in tokens of helpen_5; direct object is mostly present in helpen_1, and helpen_4, but also helpen_2, and modifiers are fairly frequent as well.
Figure 67. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- headlines (8 tokens, mostly from helpen_1);
- special collocation (3 tokens of helpen_2 with een haandje)
- resultative construction, with a preposition or adverb, such as vooruit helpen, bovenop helpen, uit iets helpen (21 tokens, mostly of helpen_1 but also helpen_2 and helpen_4).
Removed tokens
7 tokens will be removed because they instantiate very infrequent senses or idiomatic expressions or, in two cases, because they could equally refer to helpen_1 or helpen_2. The latter cases could be included in some models to see if they are modelled in an intermediate position, but they are very rare, it might be not worth pursuing.
HARDEN
Original senses and annotations
The tokens of harden were annotated with 5 senses in 8 batches; the tags in Table 53 were suggested.
| Definitions |
|---|
| harden_1 |
| (trans.) hard maken, in letterlijke zin: staal harden |
| (trans.) make hard, in literal sense: harden steel |
| harden_2 |
| (intrans.) hard worden, in letterlijke zin: snel hardende verven |
| (intr.) become hard, in literal sense: quickly hardening paint |
| harden_3 |
| (trans.) hard maken in figuurlijke zin; weerstand en veerkracht bijbrengen: een kind harden tegen het klimaat |
| (trans.) make hard in figurative sense; impart resistance and resilience: toughen a child against the weather |
| harden_4 |
| (reflex.) bij zichzelf weerstand en veerkracht aankweken: zich harden tegen het lot |
| (reflex.) develop resistance and resilience by oneself: toughen oneself against fate |
| harden_5 |
| (trans.) uithouden, verdragen: niet te harden |
| (trans.) endure, tolerate: unbearable (‘not to bear’) |
Figure 68 shows the sense distribution by annotator and batch and Figure 69, that of the disagreements. Figure 70 shows the sense tags that each annotator of each batch assigned to the tokens with harden_2 as majority sense, Figure 71 those for harden_3, Figure 72 for harden_4 and Figure 73 for harden_5. harden_1 is too infrequent to require a plot.
General distribution
The fifth sense is by far the most frequent in all the batches, followed by harden_3. The rest of the senses are quite infrequent; even the wrong_lemma tag is more frequent than them in some cases. That said, there is little disagreement, focused on tokens with harden_3 or wrong_lemma as majority sense.
There is only one token with no agreement, which is an instance of the adjective hard, and 34 (10.62% of the tokens) with wrong_lemma as majority sense, which are instances of surnames or hard as an adjective or adverb (Table 54).
Figure 68. Distribution of senses of ‘harden’ per annotator and batch.
Figure 69. Distribution of disagreeing annotations of ‘harden’ per annotator and batch.
Disagreement in harden_1
There are only four tokens with harden_1 as majority sense: one in batch 1 has full agreement, but the other three, in batches 1, 2 and 5, have harden_2 as alternative.
Disagreement in harden_2
There are 0 to 3 tokens per batch of this sense, 5 of which have harden_1 as alternative. 5 of them are actually instances of uitharden.
Figure 70. Sense annotations of tokens with ‘harden_2’ as majority sense.
Disagreement in harden_3
This sense covers about 10%-30% of each batch, with quite some disagreement. In many cases harden_4 is an alternative annotation, but sometimes other tags as well.
Figure 71. Sense annotations of tokens with ‘harden_3’ as majority sense.
Final senses
The final definitions are the same as the original definitions: no (sub)senses were added or modified.
Original versus final sense distribution
Of the 320 tokens of harden, 275 kept their original majority senses, 4 were corrected to another original sense, and 41 were removed.
Table 55 shows in how many tokens with each majority sense which actions were taken, and Figure 74 illustrates the frequency of the final tags. Figure 75 correlates the original majority sense and the final senses.
Figure 74. Final distribution of senses of ‘harden’.
| original | correct | majority | remove |
|---|---|---|---|
| harden_1 | 1 | 3 | 0 |
| harden_2 | 0 | 9 | 5 |
| harden_3 | 1 | 63 | 1 |
| harden_4 | 2 | 9 | 0 |
| harden_5 | 0 | 191 | 0 |
| no_agreement | 0 | 0 | 1 |
| wrong_lemma | 0 | 0 | 34 |
Figure 75. Majority and final senses of ‘harden’.
Reliable cues
Table 56 shows the most frequent context words selected by the annotators as relevant. Table 57, Table 58 and Table 59 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags harden_3, harden_4 and harden_5. harden_1 and harden_2 won’t be shown because they are too infrequent: the highest ranked cue based on any attribute has a frequency of 4.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 320 tokens, 23 have no cues that match these criteria. 64 have one single cue and 233 have more than one (up to 9).
Across senses
Most of the senses are too infrequent to have stable frequent cues: for harden_5, te and niet are of course the main cues, but also frequent themes (things that cannot be tolerated), such as pijn and stank. The reflexive pronoun is a relatively frequent cue for the reflexive reading, harden_4, and zijn seems relatively frequent for harden_3.
| Rank | harden_1 | n | harden_2 | n1 | harden_3 | n2 | harden_4 | n3 | harden_5 | n4 | remove | n5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | draai/verb | 1 | laat/verb | 2 | ben/verb | 10 | zich/pron | 7 | te/comp | 171 | lab_euro/noun | 4 |
| 2 | gebruik/verb | 1 | beton/noun | 1 | door/prep | 9 | ge/pron | 1 | niet/adv | 153 | werk/verb | 3 |
| 3 | hand_vat/noun | 1 | droog/verb | 1 | heb/verb | 6 | hij/pron | 1 | pijn/noun | 41 | gewerkt/adj | 2 |
| 4 | het/det | 1 | gips/noun | 1 | hij/pron | 4 | pantser/noun | 1 | stank/noun | 35 | Amerikaans/adj | 1 |
| 5 | huid/noun | 1 | golfplaat/noun | 1 | me/pron | 4 | uzelf/pron | 1 | meer/adv | 32 | ben/verb | 1 |
| 6 | oppervlak/noun | 1 | grit/noun | 1 | mentaal/adj | 4 | zal/verb | 1 | hitte/noun | 9 | bewijs/noun | 1 |
| 7 | slijp/verb | 1 | kassei/noun | 1 | dat/det | 3 | zichzelf/pron | 1 | nauwelijks/adv | 9 | bezig/adj | 1 |
| 8 | voetzool/noun | 1 | lijm/noun | 1 | het/det | 3 | 0 | lawaai/noun | 7 | d/noun | 1 | |
| 9 | workshop/noun | 1 | plateau/noun | 1 | in/prep | 3 | 0 | geur/noun | 5 | Dick/name | 1 | |
| 10 | 0 | rij_over/verb | 1 | leven/noun | 3 | 0 | ben/verb | 4 | dreigend/adj | 1 |
harden_3
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest three slots to the left of the target and the first to the right, up to two steps away in the dependency path, mostly as modifier (#T->mod:CW), direct object (#T->obj1:CW) or verb of which the target is a complement (CW->vc:#T).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | ben/verb | 10 | L1 | 22 | #T->mod:CW | 19 | 1 | 55 |
| 2 | door/prep | 9 | L2 | 18 | #T->obj1:CW | 17 | 2 | 42 |
| 3 | heb/verb | 6 | L3 | 16 | CW->vc:#T | 15 | 3 | 12 |
| 4 | hij/pron | 4 | R1 | 12 | #T->mod:door->obj1:CW | 10 | 4 | 11 |
| 5 | me/pron | 4 | L4 | 10 | heb->[vc:#T,su:CW] | 8 | 6 | 2 |
| 6 | mentaal/adj | 4 | R3 | 10 | ben->[vc:#T,su:CW] | 6 | 7 | 2 |
| 7 | dat/det | 3 | L5 | 8 | #T->mod:in->obj1:CW | 4 | NA | 2 |
| 8 | het/det | 3 | R2 | 8 | #T->su:CW | 4 | 5 | 1 |
| 9 | in/prep | 3 | R4 | 7 | #T->mod:tegen->obj1:CW | 3 | 0 | |
| 10 | leven/noun | 3 | L10 | 3 | en->[cnj:#T,cnj:CW] | 3 | 0 |
harden_4
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest two slots to either side of the target, one step away in the dependency path, mainly as direct object (#T->obj1:CW) of the target: the parser has not recognized the reflexive pronoun as a reflexive complement.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | zich/pron | 7 | L2 | 3 | #T->obj1:CW | 8 | 1 | 8 |
| 2 | ge/pron | 1 | R1 | 3 | CW->vc:en->cnj:#T | 1 | 2 | 2 |
| 3 | hij/pron | 1 | L1 | 2 | en->[cnj:#T,cnj:vorm->obj1:CW] | 1 | 3 | 2 |
| 4 | pantser/noun | 1 | L3 | 1 | moet->[vc:#T,su:CW] | 1 | NA | 1 |
| 5 | uzelf/pron | 1 | L4 | 1 | zal->vc:en->[cnj:#T,su:CW] | 1 | 0 | |
| 6 | zal/verb | 1 | L7 | 1 | NA | 1 | 0 | |
| 7 | zichzelf/pron | 1 | L8 | 1 | 0 | 0 | ||
| 8 | 0 | R8 | 1 | 0 | 0 | |||
| 9 | 0 | 0 | 0 | 0 | ||||
| 10 | 0 | 0 | 0 | 0 |
harden_5
Next to the list of types that were selected as cues, we can see that they mostly occur in the first two slots to the left of the target and one step away in the dependency path, mainly as modifier (#T->mod:CW, filled mostly by niet but also nauwelijks, amper…) or complementizer on which the target depends on (CW->body:#T, filled by te). ben->vc:te->[body:#T,su:CW], which should actually be expressed by ben->[vc:te->body:#T,su:CW], links pijn, stank and other objects (which would be objects of harden but are subjects of zijn).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | te/comp | 171 | L1 | 173 | CW->body:#T | 159 | 1 | 339 |
| 2 | niet/adv | 153 | L2 | 145 | #T->mod:CW | 157 | 3 | 113 |
| 3 | pijn/noun | 41 | L3 | 85 | ben->vc:te->[body:#T,su:CW] | 86 | 2 | 58 |
| 4 | stank/noun | 35 | L4 | 47 | #T->mod:niet->mod:CW | 30 | 4 | 21 |
| 5 | meer/adv | 32 | L5 | 26 | #T->obj1:CW | 16 | NA | 11 |
| 6 | hitte/noun | 9 | L6 | 15 | NA | 11 | 5 | 7 |
| 7 | nauwelijks/adv | 9 | R1 | 13 | CW->mod:te->body:#T | 8 | 6 | 2 |
| 8 | lawaai/noun | 7 | L7 | 9 | ben->vc:te->[body:#T,su:en->cnj:CW] | 5 | 7 | 2 |
| 9 | geur/noun | 5 | L8 | 9 | CW->vc:te->body:#T | 5 | 8 | 1 |
| 10 | ben/verb | 4 | L9 | 8 | CW->vc:#T | 4 | 10 | 1 |
Most frequent dependency paths
Figure 76 shows the most frequent dependency paths colored by sense tag. The passive construction seems to be of preference for harden_3, while the complementizers on which the target depends on (CW->body:#T) and its extensions are typical of harden_5. The rest of the senses are too infrequent.
Figure 76. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalization (2 tokens of harden_1 and harden_2);
- atypical context: (25) and (26), of harden_1 and harden_3 respectively. In the former, there seems to be some connection missing between workshops and the string of related verbs; in the former, the collocation with nationaal is strange.
- .
Tijd voor een tweedaagse International Tool Conference met workshops scharen slijpen , harden , handvaten draaien voor beitels , fondswerving en samenwerking van werkplaatsen . Deelnemers -
Op de bank zit weliswaar genoeg talent , maar dat is alleen nationaal gehard . Dat scheelt veel met de internationale maatstaf . ’ Daarom
Removed tokens
41 will be removed because they are instances of a surname, uitharden or hard as adjective or adverb.
HERSTELLEN
Original senses and annotations
The tokens of herstellen were annotated with 5 senses in 6 batches; the tags in Table 60 were suggested.
| Definitions |
|---|
| herstellen_1 |
| (trans.) repareren, de eraan ontstane schade wegwerken: het dak herstellen |
| (trans.) repair, get rid of the damage in something: repair the roof |
| herstellen_2 |
| (trans.) tot de vorige toestand terugbrengen, doen terugkeren: de goede verstandhouding herstellen |
| (trans.) bring back, make return to the previous state: repair the understanding |
| herstellen_3 |
| (trans.) goedmaken, weer doen vergeten: een fout herstellen |
| (trans.) make good, make forget: fix a mistake |
| herstellen_4 |
| (reflex.) tot de oorspronkelijke toestand terugkeren: de rust herstelt zich |
| (reflex.) return to the original state: peace is restored |
| herstellen_5 |
| (intrans.) genezen: van een ziekte herstellen |
| (intrans.) heal: heal from a disease |
Figure 77 shows the sense distribution by annotator and batch and Figure 78, that of the disagreements. Figure 79 shows the sense tags that each annotator of each batch assigned to the tokens with herstellen_1 as majority sense, Figure 80 those for herstellen_2, Figure 81 for herstellen_3, Figure 82 for herstellen_4 and Figure 83 for herstellen_5.
General distribution
The sense distribution varies slightly between batches, but is relatively stable between annotators of the same batch. herstellen_3 is constantly the least frequent, and herstellen_1 is much more frequent in the last two batches than in the other four, while herstellen_4 presents the opposite behaviour and herstellen_2 and herstellen_5 keep a decent frequency in all batches. There is some disagreement, mostly in tokens with herstellen_2 as majority sense and especially from annotator 2 of batch 3, who disagrees with the majority in half their annotations. In this batch there is also the largest amount of tokens with no agreement, although there are some in all batches. All 12 tokens with no agreement could be assigned a sense, mostly herstellen_2.
Figure 77. Distribution of senses of ‘herstellen’ per annotator and batch.
Figure 78. Distribution of disagreeing annotations of ‘herstellen’ per annotator and batch.
Disagreement in herstellen_1
This sense covers up to 10% of the first four batches, almost with full agreement, and about 30% of the other two, with some occasional alternative annotations of almost any other sense (never of herstellen_4, the reflexive one).
Figure 79. Sense annotations of tokens with ‘herstellen_1’ as majority sense.
Disagreement in herstellen_2
This sense covers 20%-50% of each batch, but with a number of alternative annotations, mostly of herstellen_3.
Figure 80. Sense annotations of tokens with ‘herstellen_2’ as majority sense.
Disagreement in herstellen_3
This sense is attested in 1-3 tokens per batch, mostly with one of the other transitive readings as alternative.
Figure 81. Sense annotations of tokens with ‘herstellen_3’ as majority sense.
Final senses
One definition was added, based on the actual occurrences of the corpus, so that the final senses are the ones in Table 61. The question is open whether herstellen_6 is a figurative extension of herstellen_5, with financial entities as subjects instead of people, or an intransitive variation from herstellen_2, or the middle point where both meet. Only one annotator suggested this as a separate sense.
| code | Definition |
|---|---|
| herstellen_1 | (trans.) repair, get rid of the damage in something |
| herstellen_2 | (trans.) bring back, make return to the previous state |
| herstellen_3 | (trans.) make good, make forget |
| herstellen_4 | (reflex.) return to the original state |
| herstellen_5 | (intrans.) heal |
| herstellen_6 | (intrans.) of a financial/economic entity, recover |
Original versus final sense distribution
Of the 240 tokens of herstellen, 207 kept their original majority senses, 25 were corrected to another original sense, and 1 was removed. 7 tokens were assigned a new sense.
Table 62 shows in how many tokens with each majority sense which actions were taken, and Figure 84 illustrates the frequency of the final tags. Figure 85 correlates the original majority sense and the final senses.
Figure 84. Final distribution of senses of ‘herstellen’.
| original | correct | majority | new | remove |
|---|---|---|---|---|
| herstellen_1 | 7 | 32 | 0 | 0 |
| herstellen_2 | 1 | 77 | 6 | 0 |
| herstellen_3 | 2 | 9 | 0 | 0 |
| herstellen_4 | 0 | 34 | 1 | 0 |
| herstellen_5 | 3 | 55 | 0 | 1 |
| no_agreement | 12 | 0 | 0 | 0 |
Figure 85. Majority and final senses of ‘herstellen’.
Reliable cues
Table 63 shows the most frequent context words selected by the annotators as relevant. Table 64, Table 65, Table 66 and Table 67 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags herstellen_1, herstellen_2, herstellen_4 and herstellen_5. hertellen_3 won’t be shown because it is too infrequent: the highest ranked cue based on any attribute has a frequency of 4.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 240 tokens, 36 have no cues that match these criteria. 95 have one single cue and 109 have more than one (up to 5).
Across senses
The most clear profiles based on lemma-pos combination are that of herstellen_2, with in ere and evenwicht as frequent representative cues, the reflexive reading herstellen_4 with zich and herstellen_5 with van, the preposition that introduces the damage or disease someone is healing from. The cues for herstellen_1 and herstellen_2 are quite infrequent.
| Rank | herstellen_1 | n | herstellen_2 | n1 | herstellen_3 | n2 | herstellen_4 | n3 | herstellen_5 | n4 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | het/det | 3 | ere/noun | 10 | dat/det | 2 | zich/pron | 31 | van/prep | 15 |
| 2 | de/det | 2 | evenwicht/noun | 10 | bilateraal/adj | 1 | de/det | 3 | een/det | 8 |
| 3 | electrisch/adj | 2 | in/prep | 10 | euvel/noun | 1 | ons/pron | 2 | ben/verb | 4 |
| 4 | en/vg | 2 | oorspronkelijk/adj | 5 | fout/noun | 1 | situatie/noun | 2 | blessure/noun | 4 |
| 5 | fiets/noun | 2 | orde/noun | 5 | fout_DIM/noun | 1 | te/comp | 2 | hij/pron | 4 |
| 6 | leiding/noun | 2 | contact/noun | 4 | kwaad/noun | 1 | daarna/pp | 1 | te/comp | 4 |
| 7 | word/verb | 2 | het/det | 4 | miskleun/noun | 1 | economie/noun | 1 | ziekte/noun | 3 |
| 8 | aanvezeling/noun | 1 | veiligheid/noun | 4 | misser/noun | 1 | fonds/noun | 1 | de/det | 2 |
| 9 | appartement/noun | 1 | vertrouwen/noun | 4 | onrecht/noun | 1 | golfer/noun | 1 | knie_blessure/noun | 2 |
| 10 | balk/noun | 1 | democratie/noun | 2 | probleem/noun | 1 | heb/verb | 1 | kwetsuur/noun | 2 |
herstellen_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the first two slots to the left of the target, up to two steps away in the dependency path, mainly as direct object (#T->obj1:CW) of the target. The 11 cues outside the sentence belong to 6 tokens; in one of them, the theme is a pronoun with the antecedent in the previous sentence, but in the rest, there are enough cues inside the sentence or in any case those outside don’t contribute that much.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | het/det | 3 | L2 | 10 | NA | 11 | 1 | 18 |
| 2 | de/det | 2 | L1 | 9 | #T->obj1:CW | 10 | 2 | 17 |
| 3 | electrisch/adj | 2 | L3 | 6 | #T->mod:CW | 4 | 3 | 12 |
| 4 | en/vg | 2 | R1 | 5 | word->[vc:#T,su:CW] | 4 | NA | 11 |
| 5 | fiets/noun | 2 | L4 | 4 | #T->obj1:en->cnj:CW | 3 | 4 | 5 |
| 6 | leiding/noun | 2 | L5 | 4 | #T->mod:van->obj1:CW | 2 | 6 | 2 |
| 7 | word/verb | 2 | R2 | 4 | #T->mod:van->obj1:en->cnj:CW | 2 | 5 | 1 |
| 8 | aanvezeling/noun | 1 | R3 | 4 | CW->vc:#T | 2 | 9 | 1 |
| 9 | appartement/noun | 1 | L9 | 3 | en->[cnj:#T,cnj:CW] | 2 | 0 | |
| 10 | balk/noun | 1 | R4 | 3 | ->[ROOT:#T,ROOT:ben->vc:begin->pc:met->obj1:CW] | 1 | 0 |
herstellen_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest three slots to the target, one or maybe two steps away in the dependency path, mainly as direct object (#T->obj1:CW) of the target.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | ere/noun | 10 | L2 | 34 | #T->obj1:CW | 46 | 1 | 63 |
| 2 | evenwicht/noun | 10 | L1 | 19 | #T->pc:in->obj1:CW | 15 | 2 | 34 |
| 3 | in/prep | 10 | L3 | 16 | #T->pc:CW | 10 | 3 | 17 |
| 4 | oorspronkelijk/adj | 5 | L4 | 11 | CW->vc:#T | 4 | 4 | 6 |
| 5 | orde/noun | 5 | L5 | 9 | word->[vc:#T,su:CW] | 3 | 5 | 2 |
| 6 | contact/noun | 4 | R2 | 6 | #T->mod:van->obj1:CW | 2 | 6 | 1 |
| 7 | het/det | 4 | R3 | 6 | #T->obj1:evenwicht->det:CW | 2 | NA | 1 |
| 8 | veiligheid/noun | 4 | L9 | 5 | #T->pc:in->obj1:toestand->mod:CW | 2 | 0 | |
| 9 | vertrouwen/noun | 4 | L6 | 4 | ben->[vc:#T,su:CW] | 2 | 0 | |
| 10 | democratie/noun | 2 | R4 | 3 | word->[vc:#T,su:en->cnj:CW] | 2 | 0 |
herstellen_4
Next to the list of types that were selected as cues, we can see that they mostly occur in the first slot to the right of the target, one step away in the dependency path, mainly as reflexive complement (#T->se:CW).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | zich/pron | 31 | R1 | 12 | #T->se:CW | 34 | 1 | 46 |
| 2 | de/det | 3 | L3 | 9 | #T->su:CW | 8 | 2 | 4 |
| 3 | ons/pron | 2 | L1 | 8 | CW->body:#T | 2 | 3 | 2 |
| 4 | situatie/noun | 2 | L2 | 7 | #T->mod:CW | 1 | 0 | |
| 5 | te/comp | 2 | R2 | 4 | #T->su:fonds->det:CW | 1 | 0 | |
| 6 | daarna/pp | 1 | L5 | 2 | #T->su:golfer->det:CW | 1 | 0 | |
| 7 | economie/noun | 1 | R3 | 2 | #T->su:ploeg->det:CW | 1 | 0 | |
| 8 | fonds/noun | 1 | L10 | 1 | #T->su:situatie->det:CW | 1 | 0 | |
| 9 | golfer/noun | 1 | L11 | 1 | ben->vc:aan->[body:#T,su:CW] | 1 | 0 | |
| 10 | heb/verb | 1 | L12 | 1 | CW->vc:#T | 1 | 0 |
herstellen_5
Next to the list of types that were selected as cues, we can see that they mostly occur in the closes three slots to the right of the token, up to three steps away in the dependency path, mainly as modifier (#T->mod:CW, mostly filled by van) or object linked through van as either modifier or prepositional complement (#T->mod:van->obj1:CW, #T->pc:van->obj1:CW).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | van/prep | 15 | R3 | 20 | #T->mod:CW | 17 | 1 | 39 |
| 2 | een/det | 8 | R2 | 19 | #T->mod:van->obj1:CW | 13 | 2 | 37 |
| 3 | ben/verb | 4 | R1 | 16 | #T->pc:van->obj1:CW | 13 | 3 | 27 |
| 4 | blessure/noun | 4 | L1 | 13 | #T->pc:CW | 9 | 4 | 8 |
| 5 | hij/pron | 4 | R4 | 13 | #T->su:CW | 6 | NA | 4 |
| 6 | te/comp | 4 | L2 | 9 | #T->pc:van->obj1:en->cnj:CW | 4 | 5 | 3 |
| 7 | ziekte/noun | 3 | L3 | 6 | ben->[vc:#T,su:CW] | 4 | 6 | 2 |
| 8 | de/det | 2 | R6 | 6 | CW->vc:#T | 4 | 8 | 2 |
| 9 | knie_blessure/noun | 2 | L4 | 4 | NA | 4 | 9 | 1 |
| 10 | kwetsuur/noun | 2 | R5 | 4 | CW->body:#T | 3 | 10 | 1 |
Most frequent dependency paths
Figure 86 shows the most frequent dependency paths colored by sense tag. The reflexive complement is clearly exclusive of herstellen_4, direct objects tend to go for herstellen_2 and modifiers are fairly frequent.
Figure 86. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (9 tokens);
- headlines (2 tokens of herstellen_1 and herstellen_2);
- atypical contexts (2 tokens of herstellen_1 without explicit object);
- special cases (3 tokens of herstellen_3 with a material object such as schade, that could be expected to wound up between herstellen_1 and herstellen_3).
Removed tokens
1 token will be removed because it is a duplicate of another token.
HAKEN
Original senses and annotations
The tokens of haken were annotated with 5 senses in 9 batches; the tags in Table 68 were suggested.
| Definitions |
|---|
| haken_1 |
| (trans.) met of als met een haak vastmaken (aan, in, achter iets): een wagen aan een locomotief haken, een sleutel in een ring haken |
| (trans.) fix something with or as if with a hook (at, to, behind something): hook a wagon to a locomotive, a key in a key ring |
| haken_2 |
| (intrans.) met of als met een haak vastraken: de doornen haakten aan haar jas, haar paraplu bleef haken aan de deurknop |
| (intrans.) get stuck with or as if with a hook: the thorns got stuck in her coat, her umbrella got stuck in the doorknob |
| haken_3 |
| (trans.) over een uitgestoken been doen struikelen: hij werd gehaakt in de elfmeter, iemand pootje haken |
| (trans.) make trip over a stuck out leg: he was made to trip in the penalty kick, make someone trip |
| haken_4 |
| (intrans., met ‘blijven’) van gedachten, blikken e.d.: haperen, telkens terugkeren (aan of bij iets): ik bleef haken bij de herinnering aan mijn broer |
| (intrans., with blijven ‘keep’) of thoughts, gazes and such: falter, come back (to something): I kept going back to the memory of my brother |
| haken_5 |
| (intrans./trans.) zeker handwerk maken door met een staafje met een weerhaak lussen samen te weven: haken tijdens het televisiekijken, hoe ontspannend!, een babymutsje haken |
| (intrans./trans.) make handcraft by weaving loops together with a hooked needle: crochetting while watching tv, so relaxing!, crochet a baby hat |
Figure 87 shows the sense distribution by annotator and batch and Figure 88, that of the disagreements. Figure 89 shows the sense tags that each annotator of each batch assigned to the tokens with haken_1 as majority sense, Figure 90 those for haken_2, Figure 91 for haken_3, Figure 92 for haken_4 and Figure 93 for haken_5.
General distribution
The general sense distribution is quite disparate across and within batches, with a greater presence of haken_1 in batches 4 and 5, and of haken_3 in the last three batches. In 8.61% of the tokens there is no agreement; they are mostly concentrated in batches 1, 3 and 7, but the disagreement rate is in any case quite high.
A large number of tokens did not receive a majority sense among the original suggestions: 31 had no agreement at all, 54 received wrong_lemma as majority sense, 9, not_listed, and 2, unclear.
A number of the tokens without agreement could be matched to one of the original senses, but more than half were either removed, because they belonged to the wrong lemma, or were matched to the the new sense tag haken_6. All of those with wrong_lemma or unclear as majority sense and some of those with not_listed were removed because they belong to a different lemma, while the rest were linked to a different sense, mainly haken_6.
Figure 87. Distribution of senses of ‘haken’ per annotator and batch.
Figure 88. Distribution of disagreeing annotations of ‘haken’ per annotator and batch.
Disagreement in haken_1
The first sense covers less than 10% of some batches and between 25% and 60% of others, with a number of alternative annotations. The disagreements tend to be focused on one annotator per batch, particularly annotator 3 of batch 8 and annotator 2 from batch 6 with their preference for haken_2.
Figure 89. Sense annotations of tokens with ‘haken_1’ as majority sense.
Disagreement in haken_2
The second sense covers less than 10% in some batches and between 20% and 30% in others, often with haken_1 as alternative annotation.
Figure 90. Sense annotations of tokens with ‘haken_2’ as majority sense.
Disagreement in haken_3
The third sense covers less than 20% in some batches and about 30% in other; it has relatively few alternative annotations, mostly for haken_2 or geen.
Figure 91. Sense annotations of tokens with ‘haken_3’ as majority sense.
Final senses
One definition was added, based on the actual occurrences of the corpus and suggestions of the annotators, so that the final senses are the ones in Table 69.
| code | Definition |
|---|---|
| haken_1 | (trans.) fix something with or as if with a hook (at, to, behind something) |
| haken_2 | (intrans.) get stuck with or as if with a hook |
| haken_3 | (trans.) make trip over a stuck out leg |
| haken_4 | (intrans., with blijven ‘keep’) of thoughts, gazes and such: falter, come back (to something) |
| haken_5 | (intrans./trans.) make handcraft by weaving loops together with a hooked needle |
| haken_6 | (with naar ‘to’) desire, aim for |
Original versus final sense distribution
Of the 360 tokens of haken, 185 kept their original majority senses, 48 were corrected to another original sense, and 109 were removed. 18 tokens were assigned a new sense.
Table 70 shows in how many tokens with each majority sense which actions were taken, and Figure 94 illustrates the frequency of the final tags. Figure 95 correlates the original majority sense and the final senses.
Figure 94. Final distribution of senses of ‘haken’.
| original | correct | majority | new | remove |
|---|---|---|---|---|
| haken_1 | 30 | 31 | 0 | 26 |
| haken_2 | 1 | 51 | 0 | 6 |
| haken_3 | 3 | 65 | 0 | 2 |
| haken_4 | 1 | 24 | 8 | 2 |
| haken_5 | 0 | 14 | 0 | 0 |
| no_agreement | 12 | 0 | 5 | 14 |
| not_listed | 1 | 0 | 5 | 3 |
| unclear | 0 | 0 | 0 | 2 |
| wrong_lemma | 0 | 0 | 0 | 54 |
Figure 95. Majority and final senses of ‘haken’.
Reliable cues
Table 71 shows the most frequent context words selected by the annotators as relevant. Table 72, Table 73, Table 74, Table 75 and Table 76 show the ranking of cues of cues according to different attributes (type, position, path and steps) for the sense tags haken_1, haken_2, haken_3, haken_4 and haken_5.
The count only considers context words chosen by at least two annotators that also assigned the final sense. Of the 360 tokens, 106 have no cues that match these criteria. 67 have one single cue and 187 have more than one (up to 9).
Across senses
Some patterns emerge from the top lemmas selected as cues from the different senses: even the least frequent one, haken_5, has clear cues; for both intransitive senses, haken_2 and haken_4, blijven co-occurs frequently, but the rest of the lemmas differ; the literal senses haken_1 and haken_2 share elkaar and to a lesser degree aan as frequent cue; and haken_3 has its own set of football related cues, such as strafschop and penalty, next to worden and pootje. There is even a profile for the new sense haken_6, albeit based on very few tokens, and for the removed tokens, which are normally cases of haken en ogen and afhaken.
| Rank | haken_1 | n | haken_2 | n1 | haken_3 | n2 | haken_4 | n3 | haken_5 | n4 | haken_6 | n5 | remove | n6 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | aan/prep | 18 | in/prep | 25 | word/verb | 25 | blijf/verb | 14 | brei/verb | 6 | naar/prep | 3 | oog/verb | 25 |
| 2 | elkaar/pron | 9 | blijf/verb | 23 | strafschop/noun | 13 | oog/noun | 8 | naai/verb | 4 | dood/noun | 1 | en/vg | 20 |
| 3 | in/prep | 8 | elkaar/pron | 17 | strafschop_gebied/noun | 10 | aan/prep | 5 | en/vg | 3 | macht/noun | 1 | af/part | 13 |
| 4 | de/det | 6 | achter/prep | 10 | poot_DIM/noun | 7 | blik/noun | 2 | hobby/noun | 2 | naar/adj | 1 | af/adj | 7 |
| 5 | achter/prep | 3 | aan/prep | 4 | door/prep | 6 | in/prep | 2 | van/prep | 2 | roem/noun | 1 | met/prep | 7 |
| 6 | wagon_DIM/noun | 3 | de/det | 4 | penalty/noun | 6 | beeld/noun | 1 | capeje/noun | 1 | ruig/adj | 1 | af/prep | 5 |
| 7 | zijn/det | 3 | met/prep | 4 | bal/noun | 5 | bij/prep | 1 | en/of/vg | 1 | woest/adj | 1 | wat/det | 4 |
| 8 | fiets/noun | 2 | stuur/noun | 4 | foutief/adj | 5 | detail/noun | 1 | hoed_DIM/noun | 1 | 0 | los/adj | 3 | |
| 9 | hun/det | 2 | van/prep | 3 | in/prep | 4 | een/det | 1 | houd/verb | 1 | 0 | hang/verb | 2 | |
| 10 | trein_DIM/noun | 2 | het/det | 2 | de/det | 3 | ervaring/noun | 1 | kleed_DIM/noun | 1 | 0 | aan/prep | 1 |
haken_1
Next to the list of types that were selected as cues, we can see that they mostly occur in the two or three closest slots to either side of the target, one or two steps away in the dependency path, mainly as locative complement (#T->ld:CW for the preposition, #T->ld:aan->obj1:CW and #T->ld:in->obj1:CW for the object) or direct object (#T->obj1:CW).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | aan/prep | 18 | R2 | 11 | #T->ld:CW | 19 | 1 | 45 |
| 2 | elkaar/pron | 9 | L2 | 10 | #T->obj1:CW | 18 | 2 | 33 |
| 3 | in/prep | 8 | R3 | 10 | #T->ld:aan->obj1:CW | 10 | 3 | 16 |
| 4 | de/det | 6 | L1 | 8 | #T->ld:in->obj1:CW | 4 | 4 | 4 |
| 5 | achter/prep | 3 | R1 | 8 | #T->su:CW | 4 | 6 | 1 |
| 6 | wagon_DIM/noun | 3 | L3 | 6 | #T->mod:achter->obj1:CW | 2 | 7 | 1 |
| 7 | zijn/det | 3 | R6 | 6 | #T->mod:CW | 2 | 0 | |
| 8 | fiets/noun | 2 | R4 | 5 | #T->obj1:wagon_DIM->det:CW | 2 | 0 | |
| 9 | hun/det | 2 | L5 | 4 | CW->body:#T | 2 | 0 | |
| 10 | trein_DIM/noun | 2 | L7 | 4 | #T->ld:aan->obj1:die->mod:CW | 1 | 0 |
haken_2
Next to the list of types that were selected as cues, we can see that they mostly occur in the three closest slots to either side of the target, one or two steps away in the dependency path, mainly as locative complement (#T->ld:CW for the preposition, #T->ld:in->obj1:CW for the object) or verb of which the target is verbal complement (CW->vc:#T, filled by blijven).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | in/prep | 25 | L1 | 39 | #T->ld:CW | 30 | 1 | 70 |
| 2 | blijf/verb | 23 | L2 | 18 | CW->vc:#T | 23 | 2 | 50 |
| 3 | elkaar/pron | 17 | L3 | 15 | #T->ld:in->obj1:CW | 16 | 3 | 19 |
| 4 | achter/prep | 10 | R1 | 15 | #T->mod:CW | 10 | 4 | 8 |
| 5 | aan/prep | 4 | R3 | 12 | #T->ld:achter->obj1:CW | 7 | 5 | 4 |
| 6 | de/det | 4 | R2 | 11 | #T->su:CW | 7 | 7 | 2 |
| 7 | met/prep | 4 | L4 | 10 | #T->mod:met->obj1:CW | 6 | 8 | 2 |
| 8 | stuur/noun | 4 | L5 | 7 | #T->ld:aan->obj1:CW | 5 | 6 | 1 |
| 9 | van/prep | 3 | R4 | 7 | blijf->[vc:#T,su:CW] | 4 | NA | 1 |
| 10 | het/det | 2 | L6 | 5 | #T->su:en->cnj:CW | 3 | 0 |
haken_3
Next to the list of types that were selected as cues, we can see that they mostly occur in the first two slots to either side of the target, up to three steps away in the dependency path, mainly as verb of which the target is a verbal complement (CW->vc:#T, mostly filled by worden) or modifier of the target (#T->mod:CW, mostly door and foutief).
The 20 cues beyond the sentence correspond to 16 tokens; in all cases there are also cues within the sentence, but those without help specify the context of a football match.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | word/verb | 25 | L1 | 34 | CW->vc:#T | 29 | 1 | 62 |
| 2 | strafschop/noun | 13 | L2 | 30 | NA | 20 | 2 | 47 |
| 3 | strafschop_gebied/noun | 10 | R1 | 18 | #T->mod:CW | 13 | 3 | 30 |
| 4 | poot_DIM/noun | 7 | L3 | 11 | word->[vc:#T,su:CW] | 10 | NA | 20 |
| 5 | door/prep | 6 | R2 | 9 | #T->obj1:CW | 9 | 4 | 16 |
| 6 | penalty/noun | 6 | R5 | 9 | #T->ld:in->obj1:CW | 8 | 5 | 5 |
| 7 | bal/noun | 5 | R3 | 8 | #T->mod:in->obj1:CW | 8 | 6 | 2 |
| 8 | foutief/adj | 5 | R6 | 8 | #T->su:CW | 8 | 7 | 2 |
| 9 | in/prep | 4 | R4 | 7 | ->[ROOT:#T,ROOT:sta->dp:CW] | 2 | 10 | 1 |
| 10 | de/det | 3 | L4 | 6 | #T->ld:CW | 2 | 11 | 1 |
haken_4
Next to the list of types that were selected as cues, we can see that they mostly occur in the closest two or three slots to the left of the target, one or matybe two steps away in the dependency path, mainly as verb of which the target is complement (CW->vc:#T, filled by blijven) but also its subject and prepositional complements.
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | blijf/verb | 14 | L1 | 13 | CW->vc:#T | 14 | 1 | 26 |
| 2 | oog/noun | 8 | L2 | 12 | #T->ld:CW | 6 | 2 | 16 |
| 3 | aan/prep | 5 | L3 | 7 | blijf->[vc:#T,su:CW] | 6 | 3 | 5 |
| 4 | blik/noun | 2 | R1 | 5 | #T->su:CW | 5 | 4 | 2 |
| 5 | in/prep | 2 | L4 | 3 | #T->ld:aan->obj1:CW | 3 | 5 | 1 |
| 6 | beeld/noun | 1 | L5 | 3 | #T->ld:in->obj1:CW | 3 | 0 | |
| 7 | bij/prep | 1 | R3 | 2 | #T->mod:in->obj1:CW | 2 | 0 | |
| 8 | detail/noun | 1 | L10 | 1 | #T->ld:aan->obj1:brok_DIM->mod:CW | 1 | 0 | |
| 9 | een/det | 1 | L7 | 1 | #T->ld:aan->obj1:brok_DIM->mod:poëzie->mod:CW | 1 | 0 | |
| 10 | ervaring/noun | 1 | L9 | 1 | #T->ld:in->obj1:ervaring->mod:CW | 1 | 0 |
haken_5
Next to the list of types that were selected as cues, we can see that they mostly occur in the two closest slots to either side of the target, up to two steps away in the dependency path, mainly as conjunct (en->[cnj:#T,cnj:CW]).
| Rank | cw_type | n | position | n | path | n | steps | n |
|---|---|---|---|---|---|---|---|---|
| 1 | brei/verb | 6 | L1 | 6 | en->[cnj:#T,cnj:CW] | 7 | 2 | 14 |
| 2 | naai/verb | 4 | L2 | 6 | #T->obj1:CW | 3 | 1 | 9 |
| 3 | en/vg | 3 | R2 | 5 | CW->cnj:#T | 3 | 3 | 4 |
| 4 | hobby/noun | 2 | L4 | 3 | #T->ld:aan->obj1:CW | 1 | 4 | 1 |
| 5 | van/prep | 2 | L8 | 2 | #T->mod:CW | 1 | 5 | 1 |
| 6 | capeje/noun | 1 | R1 | 2 | #T->mod:van->obj1:CW | 1 | NA | 1 |
| 7 | en/of/vg | 1 | L15 | 1 | #T->su:CW | 1 | 0 | |
| 8 | hoed_DIM/noun | 1 | L3 | 1 | ben->[vc:#T,su:CW] | 1 | 0 | |
| 9 | houd/verb | 1 | L6 | 1 | ben->predc:en->[cnj:#T,su:CW] | 1 | 0 | |
| 10 | kleed_DIM/noun | 1 | L9 | 1 | ben->vc:en->[cnj:#T,su:CW] | 1 | 0 |
Most frequent dependency paths
Figure 96 shows the 10most frequent dependency paths colored by sense tag. The passive construction prefers haken_3 or haken_4 and the direct object haken_1, while the locative complement is most frequent with haken_1 and haken_2.
Figure 96. Tokens per path.
Tracking lists
For the examination of the clouds, some lists were compiled with tokens that could be interesting to track. For this lemma, these include:
- nominalizations (4 tokens, of haken_5 and haken_6);
- garden-path tokens, as is the case of (27), of haken_1 but with a human being as object;
- headlines (2 tokens, from haken_2);
- special cases (2 tokens of haken_1 that actually mean “to unhook”, 3 tokens with zich as object);
- an idiomatic expression (7 tokens of haken_1 where the object being hooked is a metaphorical wagon or similar).
- haak in zijn hand naar de oppervlakte werd gesleurd .
" Deze gek haakte me toen hij voor marlijn ging " , aldus de duiker . "
Removed tokens
Two tokens will be removed because no clear sense could be assigned or the sense was too infrequent, two because they are duplicates of other tokens and 107 because they do not correspond to the target lemma:
- 31 are instances of haken en ogen and further 6 of the noun haak;
- 31 are instances of afhaken and further 7 of other separable verbs, namely inhaken, aanhaken, doorhakken and ophaken;
- 24 are instances of vasthaken –which is a synonym of haken_1 and haken_2 and was not identified by the annotators as a separate lemma– and 6, which is an antonym and was only occasionally identified.
There used to be 13, but herkennen proved to be too messy, so it goes in storage for now.↩
In one case, there is a context word tagged by the annotators, contacten (what dat was referring to), that occurs both inside and outside the sentence. Because of a bug in the annotation tool, the first instance, outside the sentence, might have been tagged instead of the second one, and it might be the case that the annotators did not correct it after being warned of the bug.↩
One of these tokens, had six “cues” beyond the sentence: the individual cues are not relevant per se, but the whole clause they form is part of the antecedent of dat, the object of the target:
door het raam zouden hebben toegekeken .
De schoonvader heeft dat later overigens herroepen . De raadsheren proberen de gangen van de vier te volgen om zich↩ In one case, the annotators selected the context words “Voetballer” and “sportlaureaat”, from the previous sentence, instead of cues inside the sentence of the target:
2001-01-18 jean eykmans Voetballer Bart Goor kanshebber voor titel Geelse sportlaureaat Volgend weekend huldigen de sportraden van Geel , Laakdal en Meerhout hun individuele sporters en sportverenigingen die zich vorig jaar onderscheidden in hun discipline .↩ One of them and the third annotator did agree on a relevant context word, namely the head of the passive subject, but the third annotator assigned the intransitive reading herstructureren_3, so they did not agree on the sense tag.↩